THE  DISTRIBUTION  OF  AN  ITEM  FIT  INDEX 


By 

CATHERINE  M.  HOMBO 


A  DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL  OF  THE 
UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT  OF  THE 
REQUIREMENTS  FOR  THE  DEGREE  OF  DOCTOR  OF  PHILOSOPHY 

UNIVERSITY  OF  FLORIDA 

1998 


ACKNOWLEDGMENTS 


I  wish  to  thank  all  of  those  involved,  directly  or  indirectly,  in  supporting  me 
through  the  completion  of  this  work.  For  their  guidance,  suggestions,  comments, 
support,  and  patience,  bounteous  thanks  are  due  to  all  members  of  my 
committee,  especially  the  chairman,  Linda  Crocker.  It  is  because  of  Dr. 
Crocker's  inspiring  teaching  that  educational  measurement  changed  from  a 
required  course  to  an  abiding  interest  and  eventual  career  direction. 
Recognition  for  efforts  far  above  and  beyond  the  call  of  duty,  wearing  whichever 
hat  was  needed  at  the  moment,  is  offered  to  John  R.  Donoghue  of  Educational 
Testing  Service.  Dr.  Donoghue  always  provided  wisdom,  support,  criticism,  and 
humor  in  the  exact  proportions  required  at  the  moment.  His  insight  into  the 
methodological  thickets  into  which  I  wandered  always  helped  me  find  the  way 
out.  I  would  also  like  to  thank  Educational  Testing  Service  for  their  financial 
support,  both  through  the  Summer  Experience  in  Research  and  the  Warren  G. 
Willingham  GRE  Graduate  Assistantship.  The  opportunities  provided  by  these 
programs  allowed  my  interest  in  the  topic  of  this  research  to  flourish. 

Without  the  support  of  family  and  friends,  this  achievement  would  not 
have  been  possible.  I  would  like  to  gratefully  recognize  my  husband,  Eric,  who 


ii 


always  smoothes  the  path  ahead,  and  my  parents,  Patricia  and  Guerry 
McClellan,  who  invariably  provide  an  outstanding  example  simply  by  being 
themselves. 


TABLE  OF  CONTENTS 

page 


ACKNOWLEDGMENTS   ii 

ABSTRACT   vi 

CHAPTERS 

I  INTRODUCTION   1 

Overview  of  Item  Response  Theory   1 

Detecting  Item  Misfit   6 

Statement  of  the  Problem  16 

Questions  and  Hypotheses  23 

Delimitations  of  the  Study  25 

Rationale  for  the  Study  26 

II  REVIEW  OF  RELATED  LITERATURE  28 

Chi-Square  Tests  28 

Item  Fit  Indices  29 

Procedure  for  Calculation  of  Scaling  Factor  38 

III  PROCEDURES,  METHODS,  AND  RESULTS:  PHASE  1  43 

Research  Questions  43 

Methods  and  Procedures  for  Phase  1  43 

Results  of  Analysis  for  Phase  1  44 

Summary  of  Findings  55 

IV  PROCEDURES,  METHODS,  AND  RESULTS:  PHASE  2  62 

Hypotheses  62 

Methods  and  Procedures  for  Phase  2  63 

Results  of  Regression  Analyses  77 

Influential  Observations  84 

Testing  Significance  of  the  Increases  in  R2  86 


iv 


Evaluating  the  Regression  Models  91 

V  DISCUSSION  AND  CONCLUSIONS  94 

Summary  of  Procedures  and  Findings  94 

Discussion  98 

Directions  for  Future  Research  100 

APPENDIX  CORRELATION  MATRIX  OF  DEPENDENT  AND  INDEPENDENT 

VARIABLES  i  103 

I 

REFERENCES  105 

BIOGRAPHICAL  SKETCH  108 


v 


CHAPTER  I 
INTRODUCTION 


Overview  of  Item  Response  Theory 

Item  Response  Theory  (IRT)  has  evolved  into  a  useful  alternative  to  the 
classical  methods  of  test  scoring  and  item  analysis  (Bock,  1997).  The  theory 
relies  on  the  belief  that  examinees  possess  one  or  more  characteristics  that 
underlie  test  and  item  performance.  These  characteristics,  often  referred  to  as 
latent  traits,  are  not  directly  observable.  However,  item  responses  are 
observable  and  are  assumed  to  be  related  to  examinee  trait  level.  For  this 
reason,  examinee  performance  can  be  used  to  estimate  the  level  of  the  trait 
using  an  item  response  model.  In  IRT,  the  parameters  that  describe  the  item  are 
designed  to  be  independent  of  the  population  distribution  of  the  latent  trait  and 
the  parameters  that  describe  the  person  are  designed  to  be  independent  of  the 
particular  items  used  to  measure  the  trait,  given  the  availability  of  a  large  pool  of 
items  measuring  the  same  content  and  a  large  population  of  examinees.  Item 
response  models  specify  the  presumed  mathematical  relationship  between  the 
latent  trait  ( 9 )  and  the  observed  performance  on  the  item. 

The  probability  of  a  correct  response  is  described  as  a  monotonically 
increasing,  non-linear  function  of  examinee  ability  in  a  model  referred  to  as  the 
normal  ogive.  The  two-parameter  normal  ogive  item  response  model  is 


1 


Pfe^r^-Le^dz,  (1-1) 

where  /'  indexes  the  items,  Pt  (6)  is  the  probability  that  a  randomly  selected 
examinee  with  ability  6  answers  item  /  correctly,  a,  is  proportional  to  the  slope  of 
Pt  (0)  at  the  point  6  =b,,  and  b,  is  the  point  on  the  ability  scale  at  which  the  curve 
has  an  inflection  point,  changing  from  concave  up  to  concave  down.  Using  this 
model,  at  this  inflection  point  an  examinee  has  a  50%  chance  of  answering  the 
item  correctly.  This  model  was  one  of  the  first  to  be  applied  successfully  to  real 
test  data,  but  proved  difficult  to  calculate.  It  has  been  supplanted  largely  by  a 
logistic  model,  which  is  more  mathematically  tractable.  The  three-parameter 
logistic  model  is 

Da,(fl-fcj) 

where  ?i  (0) ,  a,,  and  b,  are  as  above,  c,  is  the  lower  asymptote  of  the  item 
characteristic  curve,  and  D  is  a  constant,  1 .7,  used  to  transform  the  ability  scale 
of  the  logistic  equation  to  a  scale  similar  to  that  of  the  normal  ogive  model. 

The  plot  of  the  probability  of  a  correct  response  to  the  item,  plotted 
on  the  vertical  axis,  versus  ability,  designated  theta  (0)  and  plotted  on  the 
horizontal  axis,  is  referred  to  as  the  item  characteristic  curve  (ICC).  An  example 
of  an  ICC  for  an  item  with  fairly  high  discrimination,  moderate  difficulty  and  a  low 
pseudo-guessing  parameter  is  shown  in  Figure  1-1.  The  range  of  ability  values 


3 

in  the  plot  is  from  -4.0  to  4.0,  which  should  include  the  ability  levels  of  nearly  all 
examinees  in  the  population. 

In  classical  test  theory,  items  have  two  main  characteristics  of  interest, 
difficulty  and  discrimination  (Crocker  &  Algina,  1986).  Both  the  normal  ogive  and 
logistic  models  use  the  item  characteristics  difficulty  (denoted  b)  and 
discrimination  (denoted  a)  as  parameters  defining  the  relationship  between  the 
latent  trait  and  the  probability  of  a  correct  response,  although  the  methods  of 
calculation,  range  of  values,  and  interpretation  of  the  values  differ  from  those  in 
classical  test  theory.  Typically,  values  for  item  difficulty  parameter  estimates 
range  from  -2.0  to  2.0,  with  lower  values  representing  easier  items,  although 
item  difficulties  outside  this  range  do  occur.  Because  negatively  discriminating 
items  are  usually  discarded,  the  typical  range  of  discrimination  parameter 
estimates  used  in  ability  tests  is  from  0  to  2.0.  Items  with  high  discrimination 
values  have  item  characteristic  curves  that  increase  rapidly  as  a  function  of 
ability,  changing  from  a  low  probability  to  a  high  probability  of  correct  response 
in  a  narrow  range  of  ability.  When  the  logistic  model  is  applied  to  items  on 
which  examinees  have  a  substantial  probability  of  randomly  obtaining  the  correct 
response,  such  as  multiple  choice,  a  third  parameter  (denoted  c)  representing 
this  probability  is  included. 


5 


Item  response  theory  provides  another  item  property  called  information. 
The  information  function  used  in  this  study,  the  Fisher  information  function,  is 
defined  as 

where  /'  indexes  items,  P,(0)is  the  function  given  in  equation  1-2  and    (0)  is  the 
first  derivative  of  the  function  (van  der  Linden  &  Hambleton,  1997).  Examinees 
with  low  ability  responding  to  items  with  very  high  difficulty  have  very  little 
probability  of  a  correct  response,  and  for  such  examinees  such  an  item  provides 
little  or  no  information  about  the  examinee's  ability.  The  situation  is  similar  for 
items  with  very  low  difficulty  and  examinees  with  very  high  ability.  These 
examinees  almost  always  respond  correctly  to  such  items,  so  very  little 
information  is  provided  about  their  ability  by  such  items.  Items  that  are 
challenging  but  not  impossibly  difficult  for  a  particular  examinee  provide  the  most 
information  about  that  examinee's  ability.  Item  information  depends  on  the  slope 
of  the  item  response  function  and  the  conditional  variance  at  each  theta  level. 
An  item  with  small  response  variance  and  a  steep  ICC  slope  has  a  small 
standard  error  of  measurement  and  provides  a  large  amount  of  information 
about  an  examinee  whose  ability  estimate  is  at  or  near  the  maximum  of  the 
information  function.  A  plot  of  an  item  information  function  appears  in  Figure  1- 
1 ,  with  the  ICC  for  the  same  item  also  shown.  The  item  information  function  for 


6 

this  item,  as  for  most  item  information  functions,  is  bell-shaped.  This  item 
provides  a  substantial  amount  of  information  about  examinees  with  theta 
estimates  between  -0.5  and  2.0,  and  very  little  information  for  examinees  outside 
this  range. 

Detecting  Item  Misfit 
Using  Point  Estimates  of  Ability 

Examinee  response  data  never  conform  exactly  to  the  expected  values 
generated  by  the  model.  Some  deviation  from  the  model  is  to  be  expected  and 
can  be  tolerated,  but  deviations  can  be  substantial  and  troubling  to  the 
assumption  that  the  IRT  model  acts  as  an  accurate  descriptor  of  the  data.  A  plot 
illustrating  item  response  data  which  are  well  fit  by  an  item  response  model  is 
shown  in  Figure  1-2,  and  a  plot  illustrating  poor  model-data  fit  is  shown  in  Figure 
1-3. 

Tests  of  model-data  fit  for  items,  referred  to  as  item  fit  indices,  have  been 
developed  to  evaluate  whether  the  misfit  between  the  observed  data  and  the 
expected  values  generated  from  the  model  are  statistically  significant  and 
require  some  action  on  the  part  of  the  test  developer.  Statistical  tests  rely  on 
some  reference  distribution  of  values,  consisting  of  a  sample  of  the  values 
assumed  to  occur  in  the  population  of  interest,  to  determine  the  significance  of 
the  test  value. 


7 


A  fit  index  for  an  item  indicates  a  statistically  significant  degree  of  misfit  between 
the  observed  data  and  the  expected  values  provided  by  the  model  if  the 
calculated  value  is  larger  than  some  specified  value  from  the  reference 
distribution.  In  evaluating  model-data  fit,  the  chi-square  family  of  distributions 
generally  is  chosen  as  the  reference  distribution  of  values. 

Calculation  of  the  chi-square  statistic  typically  requires  that  each  variable 
is  measured  on  a  categorical  scale  or  on  an  interval  scale  that  has  been  divided 
into  discrete  categories  on  the  scale  continuum.  The  observed  number  of 
responses  or  values  in  each  category  are  compared  to  an  expected  number  of 
observations  for  that  category.  The  expected  values  may  be  derived  from 
previous  information  about  the  variable  or  some  assumed  characteristics  of  the 
data,  such  as  an  assumption  that  the  data  are  normally  distributed  in  the 
population  of  interest  or  that  the  observations  fall  into  categories  by  chance. 
Such  counts  of  the  expected  and  observed  frequencies  may  be  collected  into  a 
cross-tabulation  referred  to  as  a  contingency  table,  with  the  data  categories 
forming  the  row  and  column  headings  of  the  table.  A  chi-square  statistic  value 
is  calculated  as  follows 

*:=  z  ffizai.  <1-3> 

categories 

where  0  represents  the  observed  frequency  and  E  represents  the  expected 
frequency  of  the  data  in  each  category.  The  result  of  this  calculation  is  assumed 
to  be  a  value  drawn  from  a  member  of  the  chi-square  family  of  distributions,  and 


10 

is  compared  to  the  appropriate  critical  value  to  determine  the  significance  of  the 
statistical  test. 

Use  of  chi-square  distributions  requires  data  that  meet  certain 
assumptions.  One  critical  assumption  is  that  each  observation  must  fall  into  one 
and  only  one  category.  In  tests  of  item  fit,  one  of  the  variables,  ability,  is 
continuous,  so  the  theta  range  being  considered  in  the  study  must  be  partitioned 
into  discrete  intervals  to  create  the  cross-tabulation  needed  for  use  of  a  chi- 
square  test.  The  row  headings  of  the  contingency  table  are  the  intervals  defined 
in  the  theta  distribution  and  the  column  headings  are  the  number  of  observed 
and  expected  responses.  The  interval  into  which  each  observed  examinee 
ability  estimate  falls  is  determined,  and  the  frequency  count  of  correct  or 
incorrect  responses  for  each  interval  is  placed  in  the  appropriate  cell  of  the 
contingency  table.  The  expected  frequency  count  for  each  ability  interval  is 
obtained  from  the  IRT  model  using  a  representative  value  from  the  interval,  and 
the  two  frequency  distributions  are  compared.  An  example  of  such  a  table  is 

shown  in  Table  1-1. 

Use  of  the  chi-square  distribution  as  a  reference  distribution  implies  that 
the  value  being  tested,  when  collected  over  many  data  points  to  form  a  sampling 
distribution  of  values,  would  be  very  similar  in  form  and  properties  to  the  member 
of  the  chi-square  family  being  used  as  the  reference  distribution.  The  reference 
distribution  is  chosen  based  on  the  number  of  cells  in  the  contingency  table  that 
can  be  freely  specified.  This  number  is  called  the  degrees  of  freedom  (denoted 


11 

df),  and  equals  the  mean  of  the  chi-square  distribution  (Pearson,  1900,  as  cited 
in  Cochran,  1952).  The  shape  of  the  chi-square  distributions  varies  sharply  as 
the  degrees  of  freedom  increase;  extreme  positive  skewness  is  apparent  at  low 
devalues.  The  distribution's  shape  approaches  that  of  a  normal  distribution  as 
the  df  value  increases.  The  chi-square  reference  distribution  is  used  in  tests  of 
fit  to  determine  if  the  deviations  between  the  observed  sampling  distribution  and 
a  sampling  distribution  of  theoretically  expected  values  are  large  enough  to  be 
considered  statistically  significant,  and  whether  the  observed  deviations  could 

be  due  to  chance. 

The  formula  for  determining  the  degrees  of  freedom  for  the  item  fit 
statistic  value,  and  thus  for  selecting  the  appropriate  reference  distribution  for 
the  test  of  significance,  is 

#  =  (rXc)-i»  W 
where  r  is  the  number  of  ability  intervals,  c  is  the  number  of  score  categories, 
and  m  is  the  number  of  item  parameters  estimated  from  the  observed  data.  The 
adjustment  for  the  number  of  item  parameters  is  not  consistently  used,  and  is 
absent  from  the  item  fit  statistic  degrees  of  freedom  calculated  in  some  item 
analysis  and  test  scoring  software  programs,  for  example  BILOG  (Mislevy  & 
Bock,  1990).  In  Table  1-1 ,  the  number  of  ability  intervals  is  10,  the  number  of 
score  categories  is  2  for  a  dichotomously-scored  item,  and  the  model  has  three 
parameters  estimated  from  the  data.  Using  the  formula,  the  degrees  of  freedom 
for  the  test  is  (2)(10)-3  or  17.  The  observed  x2  value  of  39  905  is  compared  to 


12 

the  critical  value  of  the  x~  reference  distribution  with  17  degrees  of  freedom  at  a 

significance  level  of  0.05,  which  is  37.7185.  The  observed  value  is  higher, 
indicating  significant  misfit  to  the  model. 


Table  1  -1 :  Sample  frequency  cross-tabulation 


Theta  Interval 

Observed 
Frequency  - 
Correct  Response 

Expected 
Frequency  - 
Correct  Response 

(O-Ef 
E 

[-4.0,  -3.2) 

3 

1 

4 

[-3.2,  -2.4) 

3 

5 

0.8 

[-2.4,  -1.6) 

9 

11 

0.36 

[-1.6,  -0.8) 

27 

15 

9.6 

[-0.8,  0) 

55 

33 

14.67 

[0,  0.8) 

32 

52 

7.69 

[0.8,  1.6) 

35 

41 

0.88 

[1.6,  2.4) 

21 

22 

0.045 

[2.4,  3.2) 

14 

17 

0.53 

[3.2,  4.0] 

1 

3 

1.33 

Total 

200 

200 

39.905 

13 

The  choice  of  the  theta  range  of  interest  and  the  number  of  intervals  into 
which  the  theta  range  is  partitioned  is  often  decided  by  the  researcher.  Various 
theta  ranges  and  choices  of  number  of  intervals  are  reported  in  the  literature  and 
the  default  values  provided  by  the  most  IRT  common  software  programs  are  not 
equal.  BILOG  uses  a  theta  value  range  of  -2.0  to  2.0  with  1 1  quadrature  points 
as  its  default  values,  resulting  in  10  intervals  with  an  interval  width  of  0.4. 
PARSCALE  (Muraki  &  Bock,  1996)  uses  default  theta  value  range  of  -4.0  to  4.0 
with  30  quadrature  points,  resulting  in  29  intervals  with  a  width  of  0.267.  The 
default  values  used  by  MULTILOG  (Thissen,  1991)  are  not  specified  in  the 
manual.  User-specified  values  are  accepted  by  all  three  programs.  The  choice 
of  quadrature  points  is  often  made  with  a  view  toward  efficiency  in  estimating  the 
item  parameters  from  the  examinee  response  data.  The  specified  number  of 
theta  intervals  is  used  throughout  the  program  run;  however,  choices  that  are 
optimal  for  item  parameter  estimation  may  not  be  effective  for  the  calculation  of 
item  fit  statistic  values. 
Using  Pseudocounts 

In  most  tests  of  item  fit  cited  in  the  literature,  examinee  ability  estimates 
are  treated  as  accurate  point  values,  although  such  estimates  are  known  to 
contain  some  error  of  measurement.  An  important  modification  to  the  point- 
estimate  method  for  calculation  of  the  observed  data  frequency  distribution  has 
been  suggested  (Donoghue,  1998;  Stone,  Ankenmann,  Lane,  &  Liu,  1993; 


14 

Stone,  Mislevy  &  Mazzeo,  1994)  which  attempts  to  account  for  this  error  rather 
than  ignore  it. 

Instead  of  assuming  that  point  estimates  of  examinee  ability  are  perfectly 
accurate,  an  approximation  of  the  examinee  ability  distribution  can  be 
constructed  by  dividing  the  continuum  of  ability  into  discrete  categories  with 
specified  width  and  estimating  the  probability  that  each  individual  examinee  has 
ability  level  0,  based  on  the  response  pattern  and  an  assumed  marginal  0 
distribution.  As  a  result,  instead  of  being  assigned  to  a  single  ability  level,  each 
examinee  is  represented  by  a  distribution  of  probabilities  across  several  levels  of 
the  theta  scale.  A  posterior  distribution  of  0  is  produced  by  summing  the 
probabilities  within  each  ability  level  to  form  a  "pseudocount"  of  examinees  at 
each  level. 

IRT  models  assume  that  the  amount  of  error  in  examinee  ability  estimates 
varies  across  the  ability  scale  and  permit  estimation  of  this  error.  The 
calculation  of  item  fit  statistics  using  pseudocounts,  which  incorporates  data 
provided  by  these  errors  of  estimation,  should  result  in  a  refined  measure  of  item 
misfit,  but  this  has  not  been  demonstrated  conclusively. 

The  "observed  pseudocount"  of  the  number  of  examinees  in  each  ability 
interval  is  defined  as  follows 

0  =  tfMin,  (1"5> 

n=l 

where  n  indexes  the  examinees,  f„(q)  represents  the  examinee's  posterior 
density  at  ability  category  q,  and  /„  represents  the  item  response.  When  items 


15 

are  scored  dichotomously,  i„  is  either  1 ,  if  the  examinee  gave  a  correct  response, 
or  0,  for  an  incorrect  response.  Items  to  which  the  examinee  has  not  responded 
are  treated  in  likelihood  ratio  calculations  and  in  the  pseudocounts  as 
fractionally  correct  responses  rather  than  incorrect  responses.  In  this  case,  /„  is 

defined  as  W ,  where  k  is  the  number  of  response  options.  The  "expected" 
number  of  examinees  at  each  ability  level  is  defined  as 

with  common  variables  as  defined  above  and  where  p(q)  represents  the 
estimated  probability  of  observing  a  correct  response  to  the  item  for  examinees 
at  ability  level  q  obtained  from  the  IRT  model.  In  both  cases,  the  frequency 
distribution  is  composed  of  the  results  of  the  calculation  for  each  ability  level. 

After  obtaining  the  observed  frequencies  and  predicted  frequency 
distributions  as  described,  a  likelihood  ratio  goodness-of-fit  statistic  can  be 
calculated  by  comparison  of  frequencies  at  each  level.  The  formula  for  the 
statistic  is 


(1-7) 


where  q  indexes  the  ability  intervals  and  /'  indexes  the  items.  Likelihood  ratio 
chi-square  values  tend  to  be  larger  than  Pearson  chi-square  values,  and  thus 
provide  a  more  stringent  test  of  fit  (Bock  &  Lieberman,  1970). 


16 

Statement  of  the  Problem 
Misfit  and  the  Reference  Distribution 

Unfortunately,  using  pseudocounts  in  calculation  of  the  item  fit  index  has 
resulted  in  failure  to  detect  significant  misfit,  even  when  the  data  have  been 
simulated  not  to  fit  the  model.  Allen,  Kline,  and  Zelenak,  (1996);  Stone, 
Ankenmann,  Lane,  and  Liu,  (1993);  and  Stone,  Mislevy  and  Mazzeo  (1994)  have 
suggested  that  a  likely  cause  of  the  failure  is  the  use  of  an  inappropriate  chi- 
square  reference  distribution,  resulting  from  the  use  of  the  traditional  algorithm 
for  df  calculation.  In  using  a  family  of  reference  distributions  such  as  the  chi- 
square  to  determine  statistical  significance  of  item  misfit,  it  is  important  to  use 
the  correct  member  of  the  family.  As  the  members  vary  in  mean  (i.e.  degrees  of 
freedom),  critical  values  obtained  from  distributions  at  a  fixed  alpha  level  will 
differ.  Use  of  an  inappropriate  chi-square  distribution  as  the  reference 
distribution  could  result  in  incorrect  decisions  about  the  extent  of  item  misfit. 

A  second  possible  contributing  factor  to  the  problem  of  failure  to  detect 
misfit  in  this  situation  is  the  effect  of  a  scaling  factor  needed  to  decompress  the 
values  of  the  fit  statistics.  Because  use  of  pseudocounts  is  a  violation  of  the 
assumption  of  the  independence  of  observations  within  the  data  set,  a  possible 
effect  would  be  a  reduction  of  variance  in  the  fit  statistic  values,  resulting  in  a 
distribution  which  is  approximately  the  correct  shape  but  compressed.  To 
overcome  the  effects  of  such  compression,  a  procedure  described  in  Chambers, 
Cleveland,  Kleiner,  and  Tukey  (1983)  may  be  applied  to  determine  the  degree  of 


17 

the  compression  and  rescale  the  data.  Even  when  the  rescaling  procedure  has 
been  used  (Allen,  Kline,  &  Zelenak,  1996;  Stone,  Ankenmann,  Lane,  &  Liu, 
1993;  Stone,  Mislevy  &  Mazzeo,  1994),  the  resulting  fit  statistic  sampling 
distribution  does  not  conform  well  to  the  reference  distribution  with  the  degrees 
of  freedom  calculated  using  the  traditional  algorithm.  The  role  of  the  scaling 
factor  in  this  observed  lack  of  conformance  between  theoretical  expectation  and 
empirical  observations  has  not  been  examined  thoroughly. 

Finally,  there  is  the  sobering  possibility  that  the  use  of  pseudocounts  to 
generate  the  fit  index  values  results  in  a  fit  statistic  with  sampling  distribution 
that  no  longer  has  the  shape  of  a  chi-square  distribution.  Thus,  it  seemed 
desirable  to  know  more  about  (a)  the  shape  of  the  item  fit  index  sampling 
distribution  when  pseudocounts  are  used,  and  (b)  the  "effective  degrees  of 
freedom"  for  the  reference  distribution,  assuming  that  a  member  of  the  chi- 
square  family  could  be  identified. 
Definition  of  Effective  Degrees  of  Freedom 

The  "effective  degrees  of  freedom"  is  the  optimal  value  for  degrees  of 
freedom  of  the  appropriate  reference  chi-square  distribution.  Only  a  simulated 
data  set  would  allow  direct  estimation  of  the  effective  degrees  of  freedom  when 
item  fit  statistic  values  are  calculated,  using  the  pseudocounts  procedure,  from 
data  simulated  to  fit  the  item  response  model.  Obviously,  direct  estimation  of  the 
effective  degrees  of  freedom  is  impossible  for  practitioners,  who  must  make 
decisions  about  item  fit  based  on  empirical  data,  usually  a  single  fit  statistic 


18 

value  for  each  item.  Even  if  a  sampling  distribution  of  fit  statistic  values  were 
available  for  the  individual  examinee  responses,  the  values  of  the  sample 
moments  could  not  be  used  to  estimate  the  effective  degrees  of  freedom  and 
scaling  factor,  as  the  question  of  whether  or  not  the  data  fit  the  model  is 
unresolved.  For  this  reason,  it  would  be  useful  to  have  some  method  of 
determining  the  effective  degrees  of  freedom  and  an  appropriate  scaling  factor 
using  information  on  test  items  that  is  available  in  typical  testing  situations. 
Source  of  the  Data 

The  secondary  data  analyses  performed  in  the  study  were  applied  to  a 
large-scale  simulated  data  set  created  by  Isham  and  Donoghue  (1995)  at 
Educational  Testing  Service.  These  researchers  generated  the  data  set  using 
item  parameters  drawn  randomly  without  replacement  from  item  parameter 
estimates  obtained  in  the  1992  National  Assessment  of  Educational  Progress 
(NAEP)  Reading  and  Mathematics  Assessments.  Two  test  lengths  were 
simulated,  30  and  60  items.  The  item  parameters  were  selected  so  that  20%  of 
each  test  was  composed  of  items  described  by  the  two-parameter  logistic  model, 
with  the  remaining  items  described  by  the  three-parameter  logistic  model.  The 
items  comprising  the  30-item  test  were  used,  along  with  30  new  items,  to  create 
the  60-item  test.  For  each  test,  1000  simulated  examinee  item  responses  were 
simulated  for  each  item.  This  process  was  replicated  1000  times. 

For  each  simulee  within  each  replication,  the  posterior  probability  at  each 
ability  level  had  been  calculated,  and  the  pseudocounts  were  used  to  form  the 


19 

observed  and  expected  distributions.  The  likelihood  ratio  item  fit  statistic  value 
also  had  been  calculated  from  comparison  of  the  distributions.  As  the  use  of 
pseudocounts  alters  the  traditional  calculation  of  the  test  of  item  fit ,  software 
programs  that  incorporate  the  necessary  calculations  are  not  generally 
available.  However,  the  NAEP-modified  version  of  BILOG/PARSCALE  was 
revised  at  ETS  to  use  pseudocounts  in  forming  observed  and  expected 
distributions.  These  values  have  not  previously  been  available  for  examination 
on  any  large-scale  empirical  or  simulated  data  set,  as  they  do  not  form  part  of 
the  standard  program  output.  This  simulated  data  set  and  the  output  from  the 
modified  BILOG/PARSCALE  program  provided  an  ideal  source  of  data  for  the 
present  study. 
Purpose 

One  purpose  of  this  study,  using  simulated  data,  was  to  determine 
whether  the  effective  degrees  of  freedom,  which  are  not  generally  available  in 
real  testing  situations,  could  be  estimated  as  a  function  of  more  readily  available 
item  or  test  characteristics.  Another  purpose  of  this  study  was  to  determine 
whether  a  scaling  factor,  used  to  adjust  for  possible  compression  in  fit  statistic 
distribution,  could  be  estimated  from  item  or  test  characteristics.  As  the  effective 
degrees  of  freedom  and  scaling  factor  acted  as  dependent  variables  in  this 
study,  the  anticipated  properties  of  the  fit  statistic  distributions  obtained  from  the 
simulated  data  set  were  verified  first.  This  was  done  in  phase  1  of  the  study.  An 
overview  of  the  ordering  of  tasks  in  phase  1  is  provided  in  the  flowchart  shown  in 


Figure  1-4.  The  estimation  procedures  was  carried  out  in  phase  2  of  the  study. 
An  overview  of  the  ordering  of  tasks  in  phase  2  is  provided  in  the  flowchart 
shown  in  Figure  1-5. 

Using  item  response  theory  models,  items  have  defined  characteristics, 
including  the  item  parameters  and  amount  of  information  items  provide.  One 
aspect  of  this  study  was  to  determine  if  the  item  characteristics  of  difficulty, 
discrimination,  and  guessing  have  a  statistically  significant  role  in  predicting  the 
scaling  factor  and  effective  degrees  of  freedom.  Also,  the  role  of  the  maximum 
value  and  the  variance  of  the  item  information  function  in  predicting  these 
variables  was  examined. 

Because  use  of  pseudocounts  creates  dependence  between  ability 
intervals  used  in  calculation  of  the  item  fit  statistic  values,  the  procedure  was 
suspect  as  a  cause  of  the  discrepancy  between  the  effective  and  expected 
degrees  of  freedom.  A  possible  solution  to  this  problem  was  to  vary  the 
bandwidth  of  the  theta  interval  occupied  by  an  examinee  to  restore,  to  some 
extent,  the  independence  between  ability  intervals.  Thus,  an  additional 
independent  variable  examined  in  this  study  was  the  number  of  "effectively 
independent  intervals"  required  to  encompass  a  theta  range  of  -4.0  to  4.0  for 
various  confidence  levels.  To  determine  this  number,  the  point  estimate  of  an 
examinee's  theta  was  expanded  to  included  confidence  intervals,  ranging  in 
width  from  50%  to  99%,  about  the  value,  creating  an  interval  that  was  assumed 
to  be  effectively  independent  from  other  intervals  created  in  the  distribution. 


21 


it 


00 


2  8. 

03  2 


e  cx^ 

w  CO 


O 

z 


c 
o 

3  O  o 

S.2  8 

09 

b 


00 


O 


(0 

5-8 
P  c 


I 


CO 
ID 


CO  c 

0)  i_  r- 

—  O  i=  C- 

<D  i-  i-  0) 

Q-  b  <D  c 

<->  TO  ■•->  = 

—  .C  CO  O 

00  <JJ 

i2  O) 


lifil 

C  a  «  c  t 

O  3  C  J)  ^ 

0  cr  ro  „_  ig 
cr 


o 

00  — 
-j  CD 

in  o  ro 


03 


c  E 

O  <D 


D  D  (- 
CD  CD  CT  _Q  o 

U)  CD 


O 
CD 


r— ^ 


O 
z 


E  oo  E 

<D  O  C  CD 

,  -  *  I  " 

E  .9  S  -Q  g 

°  |  «»  5  8 

E  " 

CD 


CD 


CD  —• 

£  -s;  <D 


2  5 


ro  oo 


oo 


E  * 
.E  ro 

00  T3 


O 
CD 
i_ 
i— 

O 

o 


CO 
LU 
>- 


00 

111 
CO 

cd  ro  -2 

ro  ® 
> 


CD 
00 

ro 

CL 
TJ 
CO 

o 

t 
ro 
.c 

O 
o 


CO 


CD 
D 


22 


T3 

E 

CD 

o 

CD  C 

-  .2 

CO 

ce 

O  CO 

o 

an 

®  i 

o 

'i_ 

Cl  o 

ues 

Va 

E  c 

(D  — 

0) 

val 

I 


c  CO 

-  P  -£ 

a>  -p  "2  -j 

P   (D  2  1) 

m  C  TO  W 

CO  fl  i 

Q_  -D  E 


«  0)  g 

o  $  E  '£ 

(0  !fc  0)-o  c 

111  CD  CD  ©  C 


O 
0) 


CM 

CD 
(/> 
CO 
-C 
D. 

>< 
"D 

CO 

o 
r 

CO 

.c 
O 

1 

LL 


CD 
i— 


The  various  confidence  levels  were  calculated  to  determine  whether  a 
"sufficient"  level  of  restored  independence  exists. 

Intervals  were  created  to  encompass  the  distribution  of  interest.  The 
number  of  effectively  independent  intervals  created  across  the  theta  distribution 
was  used  a  predictor  of  the  effective  degrees  of  freedom  and  scaling  factor 
needed  to  decompress  the  item  fit  statistic  values. 

Questions  and  Hypotheses 

Phase  1 

The  following  research  questions  were  explored  in  phase  1  of  the  study  for  the 
distribution  of  the  item  fit  statistics: 

Question  1 :  Are  the  distributions  of  the  item  fit  statistics  a  member  of  the  chi- 
square  family? 

Question  2:  Are  the  distributions  of  the  item  fit  statistics  compressed? 
Question  3:  After  decompression,  do  the  distributions  of  the  item  fit  statistics 
conform  to  the  chi-square  reference  distribution  with  the  effective  degrees  of 
freedom? 
Phase  2 

The  following  hypotheses  were  evaluated  in  phase  2  of  the  study  for  the 
distribution  of  the  item  fit  statistics: 

Hi1:  The  item  parameters  will  be  a  statistically  significant  predictor  of  the 
scaling  factor  calculated  from  the  sample  moments  of  the  distribution. 


24 

H^:  The  item  parameters  will  be  a  statistically  significant  predictor  of  the 
effective  degrees  of  freedom  calculated  from  the  sample  moments  of  the 
distribution. 

H^:  The  maximum  value  of  the  item  information  function  will  be  a  statistically 
significant  predictor  of  the  effective  degrees  of  freedom  calculated  from  the 
sample  moments  of  the  distribution. 

H,4:  The  variance  of  a  sample  of  item  information  values  will  be  a  statistically 
significant  predictor  of  the  effective  degrees  of  freedom  calculated  from  the 
sample  moments  of  the  distribution. 

Hi5:  The  maximum  value  of  the  item  information  function  will  be  a  statistically 
significant  predictor  of  the  scaling  factor  used  to  decompress  the  data  calculated 
from  the  sample  moments  of  the  distribution. 

H^:  The  variance  of  a  sample  of  item  information  values  will  be  a  statistically 
significant  predictor  of  the  scaling  factor  used  to  decompress  the  data  calculated 
from  the  sample  moments  of  the  distribution. 

Hi7:  The  number  of  effectively  independent  intervals  will  be  a  statistically 
significant  predictor  of  the  effective  degrees  of  freedom  calculated  from  the 
sample  moments  of  the  distribution. 

H18:  The  number  of  effectively  independent  intervals  will  be  a  statistically 
significant  predictor  of  the  scaling  factor  used  to  decompress  the  data  calculated 
from  the  sample  moments  of  the  distribution. 


25 

Delimitations  of  the  Study 

All  item  parameters  were  drawn  from  the  1992  National  Assessment  of 
Educational  Progress  (NAEP)  Reading  and  Mathematics  Assessments.  These 
parameters  may  not  represent  all  item  parameter  combinations  of  interest  to 
practitioners.  For  example,  few  items  with  extreme  parameter  values  appeared 
in  the  data,  but  such  items  may  be  of  interest  in  other  situations. 

Estimates  of  the  item  parameters,  based  on  the  simulated  item  responses, 
were  used  in  computing  the  posterior  distributions  of  the  individual  examinees. 
However,  true  item  parameters  were  used  in  all  analyses  in  this  study,  and  the 
effects  of  using  estimated  item  parameters  in  the  specific  calculations  performed 
in  this  study  were  not  considered.  This  study  was  further  restricted  to  tests  of 
length  30  or  60  items  drawn  randomly  from  a  longer  examination,  with  simulated 
sample  sizes  of  1000.  Thus,  the  results  may  not  generalize  to  items  with 
parameters  very  different  from  those  selected  for  analysis  here,  or  to  samples  of 
other  sizes,  or  to  tests  of  different  lengths. 

It  was  assumed  also  that  the  item  parameters  sampled  from  the  1992 
National  Assessment  of  Educational  Progress  (NAEP)  Reading  and  Mathematics 
Assessments  are  representative  of  item  parameters  encountered  in  many 
practical  large-scale  testing  situations.  A  further  assumption  was  that  the 
examinee  data  were  simulated  to  represent  a  population  in  which  the  trait  being 
measured  is  normally  distributed.  The  effects  of  using  skewed  or  non-central 
samples  were  not  investigated.  The  trait  measured  by  the  data  was  assumed  to 


26 

be  unidimensional,  which  means  that  performance  on  the  trait  is  evidence  of  a 
single  ability,  and  the  responses  were  assumed  to  have  been  scored 
dichotomously. 

Rationale  for  the  Study 
Determination  of  item  fit  to  the  presumed  theoretical  model  plays  an 
important  role  in  the  development  of  high-quality  assessments.  The  inability  to 
rely  on  the  statistical  significance  of  tests  of  item  misfit  may  support  poor 
conclusions  about  item  properties.  Test  designers  and  developers  do  not  have 
complete  information  on  which  to  base  decisions  about  item  inclusion,  exclusion, 
or  revision,  and  may  be  led  to  make  poor  choices  as  a  result.  This  has  the 
potential  to  lower  the  validity  with  which  high-stakes  decisions  are  made  about 
examinees. 

Use  of  a  theoretical  asymptotic  distribution  in  determining  significant  item 
misfit  is  ineffective  if  the  nominal  distribution  being  used  is  not  a  good 
approximation  of  the  actual  empirical  distribution  of  the  statistic.  If  the  degrees 
of  freedom  are  considerably  less  than  the  nominal,  then  determinations  of 
significance  of  item  misfit  based  on  the  statistical  tests  are  made  incorrectly.  An 
efficient  and  practical  way  to  determine  the  effective  distribution  for  testing  the 
significance  of  item  misfit  from  information  available  in  a  data  set  of  actual 
examinee  responses  would  substantially  increase  the  validity  of  decisions  about 
item  fit  based  on  statistical  tests  of  significance. 


To  increase  the  utility  of  the  item  fit  indices  calculated  by  parameter 
estimation  software  programs,  the  underlying  distribution  must  be  better 
understood.  The  properties  of  sampling  distributions  drawn  from  studies  using 
both  actual  and  simulated  data  suggest  that  the  values  comprising  the  item  fit 
statistic  distribution  are  compressed  to  some  degree  and  that  the  mean  values 
are  not  as  large  as  expected.  This  has  precluded  general  reliance  on  the 
statistical  significance  of  item  misfit  as  a  flag  for  item  review  or  removal  in 
practical  testing  situations.  A  reliable  method  for  adjusting  the  fit  statistic  values 
to  remove  the  compression  and  for  determining  the  appropriate  distribution  to 
evaluate  significance  of  item  misfit  would  increase  the  confidence  of  test  users 
and  developers  in  the  value  of  the  procedure.  Such  confidence  will  be 
necessary  for  reliance  upon  the  determination  of  significance  of  item  fit  indices 
to  become  a  routine  part  of  test  development  and  evaluation.  This  study  was 
designed  to  test  a  possible  method  for  adjusting  fit  statistic  values  and  to 
evaluate  the  utility  of  a  method  for  selecting  a  more  appropriate  reference 
distribution  for  testing  the  significance  of  item  misfit. 


CHAPTER  II 
REVIEW  OF  RELATED  LITERATURE 

Chi-Sauare  Tests 

Most  item  fit  indices  are  believed  to  have  distributions  in  the  chi-square 
family.  Chi-square  tests  have  a  long  history  of  use  in  statistics,  going  back  a 
publication  in  1900  by  Karl  Pearson  (Cochran,  1952).  Use  of  chi-square  tests, 
as  with  all  statistical  tests,  requires  data  that  conform  to  the  assumptions  of  such 
tests.  If  the  data  do  not  conform,  it  is  possible  that  the  use  of  a  chi-square 
distribution  as  the  reference  distribution  will  be  compromised.  The  assumptions 
for  use  of  a  Pearson  chi-square  test  are  as  follows:  each  observation  falls  into 
exactly  one  cell  of  the  contingency  table;  the  outcomes  in  the  sample  are 
independent;  and,  the  sample  is  large  (Yen,  1981).  It  has  also  been 
recommended  that  there  be  a  minimum  expected  frequency  in  each  cell  of  the 
contingency  table,  with  recommendations  varying  from  one  to  5  to  10,  depending 
on  the  author.  Cochran  (1952)  noted  that  such  minimum  values  seem  to  have 
been  "arbitrarily  chosen"  (p.328),  but  indicated  that  a  single  interval  with  as  few 
observations  as  0.5  or  two  intervals  with  a  number  of  observations  as  low  as  1 
caused  little  disturbance  at  the  5%  level,  or  at  the  1%  level  if  the  degrees  of 
freedom  of  the  test  was  larger  than  6. 

Problems  have  been  associated  with  the  use  and  interpretation  of  chi- 
square  tests  in  general,  including  their  sensitivity  to  sample  size  and 


28 


29 

questionable  validity  when  the  expected  value  in  any  cell  in  the  contingency 
table  has  a  frequency  less  than  1.00  (Hambleton  &  Swaminathan,  1985).  Ironson 
(1982)  stated  that  "most  of  the  difficulty  in  setting  intervals  stems  from  the 
necessity  of  having  an  adequate  number  of  observations"  within  each  interval 
(p.  123).  Also,  in  continuous  data  that  must  be  grouped  into  classes,  the  number, 
size,  and  division  points  chosen  by  the  investigator  will  affect  the  sensitivity  of 
the  test  (Ironson,  1982;  Reise,  1990).  When  applying  chi-square  tests  to  item 
response  and  test  score  data,  Scheuneman  (1979)  noted  that  the  number  of 
ability  intervals  that  can  be  created  in  a  particular  data  set  is  dependent  upon 
factors  such  as  item  difficulty,  test  length,  sample  size,  relative  skew  of  the 
distribution,  and  the  minimum  number  of  observations  allowed  per  interval.  All  of 
these  issues  must  be  considered  as  potential  problems  for  item  fit  indices  that 
are  based  on  chi-square  tests. 


Item  Fit  Indices 

The  misfit  of  test  items  to  the  proposed  theoretical  model  will  detract  from 
the  validity  of  IRT-based  measurements  and  thus  the  validity  of  the  resulting 
analyses  and  decisions  (McKinley  &  Mills,  1985).  The  advantages  of  using  an 
IRT-based  model  accrue  only  when  the  assumptions  of  the  model  are  met,  and 
item  fit  indices  are  intended  to  assist  users  in  determining  whether  the  data  do, 
to  an  acceptable  extent,  meet  the  required  assumptions  (Rogers  &  Hattie,  1987). 
Lack  of  fit  can  be  caused  by  inadequate  estimation  procedures  or  by  violation  of 


30 

model  assumptions.  As  observed  in  McDonald  and  Mok  (1995),  advice  for 
practitioners  regarding  item  fit  index  choice  and  the  establishment  of  criteria  to 
guide  use  in  applied  settings  will  require  further  work. 

Interest  in  an  effective  methodology  for  assessing  fit  at  the  item  level 
remains  high,  especially  among  developers  of  large-scale  assessments.  Such 
methodology  would  allow  early  detection  and  possible  correction  of  potential 
problems  with  items,  saving  time  and  money  in  the  test  development  process.  In 
the  1977-78  National  Assessment  of  Educational  Progress  (NAEP),  item 
statistics  played  only  a  minor  role  in  the  development  of  the  mathematics  test 
(Hambleton  &  Swaminathan,  1985).  Instead,  content  considerations  as 
established  by  mathematics  specialists  played  the  dominant  role  in  item 
selection.  A  one-parameter  model  was  used  in  the  analysis  of  the  data  from  this 
examination.  When  the  fit  of  particular  items  to  the  one-parameter  model  was 
assessed,  cyclic  variations  and  large  standardized  residuals  were  found.  The  fit 
of  the  three-parameter  model  to  the  same  items  was  also  assessed,  with  very 
different  results.  The  cyclic  pattern  disappeared  and  the  size  of  the  residuals 
was  greatly  reduced,  indicating  that  for  these  particular  items,  the  three- 
parameter  model  demonstrated  a  much  better  fit  than  the  one-parameter  model. 
Possible  reasons  for  the  detected  misfit  include  use  of  a  model  that  does  not 
allow  for  variation  in  the  item  discrimination  or  examinee  guessing  behavior. 

Linn  (1998)  evaluated  exemplar  items  from  the  1992  NAEP  mathematics 
assessment  based  on  their  content  and  statistical  properties.  He  noted  that 


31 

there  is  an  inherent  contradiction  between  the  classification  of  students  at  the 
proficient  level  and  the  selection  of  certain  items  as  exemplars  of  that 
classification  when  the  majority  of  the  proficient  students  cannot  correctly 
complete  the  proficient  exemplar  items.  This  is  an  example  of  a  situation  in 
which  item  fit  indices  would  have  been  useful  in  locating  a  discrepancy  between 
the  proportion  of  students  expected  to  respond  correctly  to  certain  items  and 
those  observed  to  actually  do  so. 

In  a  more  recent  version  of  NAEP,  the  1994  national  administration,  item 
fit  statistics  from  the  computer  program  PARSCALE  (Muraki  &  Bock,  1996)  were 
used  in  identifying  items  that  need  to  be  examined  for  flaws  (Allen,  Kline,  & 
Zelenak,  1996).  As  the  distributions  for  such  statistics  are  not  well  understood, 
the  item  fit  statistics  were  not  used  for  final  decisions  about  item  fit  to  the  model. 
The  lack  of  statistical  tests  meant  that  the  fit  of  the  observed  data  to  the 
theoretical  model  was  also  assessed  through  examination  of  a  plot  of  the 
empirical  versus  the  theoretical  item  response  function  curves.  However,  this 
process  is  time-consuming  and  relies  on  visual  judgment  as  to  the  goodness  of 
the  data  fit. 

The  NAEP  1996  State  Assessment  Program  in  Science  assessed  the  fit  of 
the  items  to  the  IRT  model  both  during  and  after  item  parameter  estimation 
(Allen,  Swinton,  Isham,  &  Zelenak,  1998).  Again,  item  fit  statistics  were 
calculated  as  indicators  of  misfit,  but  final  decisions  about  item  inclusion  were 
based  upon  graphical  analyses.  A  "certain  degree  of  subjectivity  is  involved  in 


32 

determining  the  degree  of  fit  necessary  to  justify  use  of  the  model"  (p.  177),  but 
the  use  of  "seemingly  more  objective  procedures"  (p.  177)  such  as  the  pseudo 
chi-square  goodness  of  fit  indices  was  not  practical.  It  was  not  clear  how  to 
estimate  the  correct  degrees  of  freedom  and  scaling  factors  needed  to  adjust  the 
item  fit  statistics  from  BILOG/PARSCALE  and  determine  their  significance. 

Wright  and  Panchapakesan  (1969)  developed  an  early  test  of  item  fit  for 
use  with  the  Rasch  model.  As  the  one-parameter  model  was  being  used,  item 
misfit  could  be  caused  by  items  with  discrimination  values  different  from  other 
items  and  guessing,  as  well  as  construction  or  scoring  errors  or  multidimensional 
responses.  The  item  fit  index  was  noted  to  be  only  approximately  chi-square,  so 
a  recommendation  was  made  that  items  with  significant  test  values  be  examined 
and  a  decision  made  about  which  to  eliminate  from  the  test.  Once  this  was 
done,  the  items  should  be  re-run  and  fit  statistic  values  should  be  examined 
again.  Subsequent  researchers  (Divgi,  1981;  van  den  Wollenberg,  1982)  found 
that  this  item  fit  statistic  was  not  distributed  as  a  chi-square  and  that  the  degrees 
of  freedom  associated  with  the  test  were  not  as  high  as  assumed.  This  difficulty 
with  the  identification  of  the  correct  distribution  and  degrees  of  freedom  for  the 
test  of  item  fit  is  one  that  has  continued  to  plague  developers  and  users  of  such 
indices.  Divgi  (1981)  chose  to  sort  5,512  examinees  by  score  and  then  partition 
them  into  groups  of  size  200,  giving  27  groups  of  full  size  and  one  of  1 12 
examinees.  As  is  often  the  case  in  the  item  fit  literature,  this  partition  into  ability 
or  score  groups  was  simply  stated  and  not  explained. 


33 

Bock,  in  a  1972  article,  proposed  one  of  the  most  commonly  cited  item  fit 
indices  as  part  of  a  study  on  the  effect  of  multiple-category  item  scoring  on  the 
estimation  of  item  parameters  and  ability  scores  for  examinees.  A  particular 
partition  of  the  ability  distribution  was  not  specified  in  defining  the  test  of  item  fit, 
only  that  the  examinees  be  "ranked  according  to  their  provisional  estimated 
ability,  and  the  ranking  is  divided  into  q  fractiles"  (p.  41 ).  A  representative  value 
was  used  from  the  interval,  either  the  median  (Bock,  1972)  or  the  centroid  (Bock 
&  Jones,  1968),  to  estimate  the  expected  proportion  of  examinees  in  the  interval. 
Bock  (1972)  noted  that  the  requirement  of  local  independence,  one  of  the 
assumptions  required  for  the  use  of  IRT  models,  was  relaxed  to  the  extent  that 
subjects  within  a  small  neighborhood  of  ability  estimates  are  assumed  to 
respond  independently  to  different  items  for  the  purpose  of  grouping  those  with 
similar  ability  estimates.  However,  the  question  of  exactly  what  size  the 
neighborhood  must  be  to  justify  the  assumption  of  independence  was  left  "for 
later  empirical  study"  (p.37).  An  example  in  the  article  uses  10  ability  groupings 
for  557  subjects,  and  item  fit  is  assessed  for  each  of  20  vocabulary  items  using  a 
Pearson  chi-square.  This  use  of  10  ability  intervals  to  assess  item  fit  seems  to 
be  the  first  in  the  literature,  although  not  the  last. 

Yen  introduced  another  fit  index,  called  Q1t  in  a  1981  paper,  and 
examined  its  behavior  using  simulated  data.  Qi  is  a  special  case  of  Bock's  fit 
index,  with  the  number  of  ability  intervals  defined  to  be  10  and  the  mean  theta 
value  in  each  interval  used  to  determine  the  expected  number  of  observations 


34 

occurring  in  each  interval.  The  index  was  defined  using  a  10-interval  partition  of 
the  underlying  ability  distribution,  so  that  approximately  equal  numbers  of 
examinees  fell  into  each  interval.  The  contingency  table  was  ten  by  two, 
depending  on  whether  the  examinee  passed  or  failed  the  item.  The  test  was 
defined  as  having  10-m  degrees  of  freedom,  where  m  is  the  number  of  item 
parameters  estimated  from  the  data.  A  comparison  of  Qi  to  several  other  fit 
measures  with  assumed  chi-square  distributions  was  made,  with  all  fit  indices 
using  10  ability  intervals  to  facilitate  the  comparison.  The  fit  indices  considered 
were,  with  the  exception  of  Q4*,  highly  correlated.  Q4*  is  a  fit  measure  originally 
designed  for  use  with  the  Rasch  model  and  it  did  not  appear  to  have  the 
expected  distribution,  as  the  values  were  below  those  expected.  The 
assumption  of  1 0-m  degrees  of  freedom  was  not  rejected  for  the  other  fit 
statistics  being  considered,  but  the  mean  value  of  Qi  was  always  greater  than 
10-m,  perhaps  indicating  that  the  degrees  of  freedom  of  the  test  was  not 
completely  accurate.  In  a  later  paper  assessing  the  effects  of  local  item 
dependence  on  fit,  Yen  (1984)  again  used  10  ability  intervals  for  all  fit  indices  to 
facilitate  comparisons  to  the  performance  of  Qi.  The  intervals  were  selected  so 
that  the  number  of  examinees  in  each  interval  was  approximately  equal,  but  Yen 
acknowledged  that  this  decision  was  determined  by  the  observed  distribution  in 
the  sample.  The  results  of  the  study  indicated  that  Qi  is  not  sensitive  to 
multidimensionality  and  cannot  be  relied  upon  as  a  complete  indicator  of 
significant  item  misfit. 


35 

Stone,  Ankenmann,  Lane,  and  Liu  (1993)  expressed  concern  about  item 
fit  to  a  model  in  terms  of  examinee  misclassification,  especially  in  the  context  of 
performance  assessment.  During  the  partitioning  of  the  underlying  theta  ability 
scale  all  examinees  are  placed  in  an  ability  interval.  If  an  examinee's  theta 
score  estimate  is  inaccurate,  the  examinee  may  be  placed  in  the  incorrect  ability 
interval.  This  possibility  is  especially  likely  on  examinations  consisting  of  few 
items,  a  condition  often  the  case  in  performance  assessments.  Mote  and 
Anderson  (1965)  found  that  Type  1  error  rates  were  inflated  and  the  asymptotic 
power  of  the  test  was  reduced  when  classification  errors  were  ignored. 
Corrections  were  offered  under  two  specific  conditions:  (a)  if  the 
misclassification  rate  is  constant  across  classes  or  (b)  if  the  misclassification 
occurs  only  between  adjacent  intervals  at  a  constant  rate.  Unfortunately,  neither 
assumption  is  tenable  in  most  real  data  sets,  as  the  misclassification  rates  are 
likely  to  be  directly  related  to  the  amount  of  error  in  the  theta  estimate,  which 
increases  at  extreme  values  of  theta,  and  inversely  related  to  the  size  of  the 
intervals  chosen. 

Many  indices  of  item  fit  are  constructed  using  point  estimates  of  examinee 
ability,  leading  to  possible  classification  errors.  This  methodology  treats  the 
examinee  ability  estimate  as  precise  and  uses  estimates  of  the  expected 
frequencies  at  each  ability  level.  However,  the  estimates  of  examinee  ability  are 
not  without  error,  and  this  assumption  of  precision  may  not  be  warranted.  As  an 
alternative  to  this,  the  imprecision  with  which  examinee  ability  is  estimated  can 


36 

be  considered  using  the  probability  that  an  examinee  has  a  particular  ability 
level  0.  The  conditional  distribution  of  6  given  a  response  pattern  x  is  derived 
from  a  Bayesian  procedure  and  is  defined  as 

P(6\x)=P(x\6)  P(6)/P(x),  (2.1} 

where  P(9\  x)  is  the  posterior  distribution  of  9,  P(x  \0)\s  the  conditional 
probability,  or  likelihood,  of  the  response  pattern  x,  P(0)  is  the  prior  distribution 
of  6,  typically  N(0,  1 ),  and  P(x)  is  the  marginal  probability  of  response  pattern  x 
for  an  examinee  of  unknown  6  randomly  sampled  from  a  population  with  the 
given  distribution.  This  procedure,  used  by  Stone  et  al.  (1993)  and  Stone, 
Mislevy,  and  Mazzeo  (1994),  results  in  a  distribution  of  probabilities  across  the  6 
scale  for  each  examinee. 

Results  from  Stone  et  al.  (1993)  indicated  that  a  scaling  factor  for  the  item 
fit  statistics  and  corrected  value  for  the  degrees  of  freedom  were  needed  and 
could  be  estimated  using  the  moments  of  a  sampling  distribution  of  the  item  fit 
statistics.  Stone  et  al.  (1994)  designed  a  simulation  experiment  to  estimate  the 
scaling  factor  and  correct  degrees  of  freedom  using  regression.  They 
manipulated  three  factors:  (a)  length  of  test,  (b)  difficulty  of  items,  and  (c)  the 
interval  width  of  the  ability  partition.  A  one-parameter  model  was  used  in  this 
study,  so  difficulty  was  the  only  item  parameter  to  be  estimated  and  used  in  the 
regression.  The  test  length  variable  was  manipulated  by  defining  item  difficulties 
for  five  items.  For  test  of  length  10,  20,  and  40,  the  original  5-item  set  was  re- 
used as  necessary  to  make  up  the  desired  length. 


37 

Three  interval  widths  were  selected,  translating  into  11,  21,  and  41 
quadrature  points,  although  the  rationale  behind  the  selection  of  those  numbers 
was  not  given.  This  variable  was  expected  to  be  an  important  one,  as  the 
number  and  width  of  the  intervals  chosen  directly  affects  the  number  of  intervals 
over  which  a  single  examinee's  posterior  theta  distribution  would  extend  and 
thus  the  degree  of  dependence  created.  As  wider  intervals  on  the  theta  scale 
are  defined,  fewer  examinees'  posterior  theta  distributions  extend  across  interval 
boundaries.  If  no  examinee's  posterior  theta  distribution  extended  across  an 
interval  boundary,  the  intervals  would  be  independent.  The  number  of  intervals 
chosen  also  directly  affects  the  nominal  degrees  of  freedom  for  the  chi-square 
test. 

The  results  of  the  study  show  that  the  effective  degrees  of  freedom  for  all 
test  length  conditions  ranged  from  2.34  to  5.63,  although  if  theta  were  known 
precisely  the  degrees  of  freedom  of  some  tests  should  be  as  high  as  20.  Interval 
width  appeared  to  play  no  role  in  the  prediction  of  the  effective  degrees  of 
freedom,  possibly  due  to  the  fact  that  as  more  and  narrower  intervals  are  used, 
the  examinee  posterior  encompasses  more  intervals,  negating  any  significant 
effect.  Adjustments  to  the  fit  statistics  appeared  to  be  most  effective  when  the 
interval  widths  were  0.4  or  0.2.  Almost  all  variance  in  the  estimates  of  the 
scaling  factor  and  the  effective  degrees  of  freedom  was  accounted  for  by  the 
average  examinee  posterior  theta  variance. 


38 

The  study  results  offer  methodology  that  seems  fairly  effective,  although 
very  complex  calculations  are  required  to  determine  value  of  the  examinee 
posterior  variance.  A  method  using  data  more  readily  available  in  typical  testing 
situations  would  be  preferable  if  it  can  be  shown  to  be  equally  effective  in 
predicting  the  scaling  factor  and  effective  degrees  of  freedom.  The  method  was 
examined  only  with  data  generated  with  the  1  PL  model,  so  the  utility  in  situations 
where  other  models  are  used  was  not  established.  The  repeated  use  of  the 
same  5  items  may  have  had  an  unanticipated  effect  on  the  estimation  of 
examinee  ability,  and  thus  the  value  of  the  examinee  posterior  theta  variance.  A 
study  in  which  items  with  unique  parameters  was  used,  reflecting  a  more- 
common  testing  situation,  might  have  resulted  in  somewhat  different  estimates  of 
the  examinee  ability  and  altered  the  effectiveness  of  the  predictive  models. 

Procedure  for  Calculation  of  Scaling  Factor 
Stone  et  al.  (1993)  used  quantile-quantile  plots  to  verify  that  the  sampling 
distributions  of  item  fit  statistics  were  members  of  the  chi-square  family,  and  to 
examine  the  congruence  of  the  sampling  distribution  to  the  particular  reference 
distribution  chosen.  A  quantile-quantile  plot  of  the  empirical  distribution  values 
versus  the  theoretical  distribution  values  can  be  examined  as  evidence  of 
conformance  of  an  empirical  distribution  with  a  supposed  theoretical  one.  The 
linearity  of  the  plot  is  used  to  determine  whether  the  empirical  data  and  the 
reference  distribution  have  the  same  distributional  shape  (Chambers,  Cleveland, 


39 

Kleiner,  &  Tukey,  1983).  Large  or  systematic  deviations  from  a  linear  plot 
indicate  lack  of  fit  between  the  reference  distribution  and  the  empirical  data. 
Vertical  shifts  suggest  that  all  values  in  the  empirical  distribution  differ  from 
those  in  the  theoretical  distribution  by  a  constant  value,  and  the  shift  can  be 
removed  by  addition  or  subtraction  of  this  constant  to  the  empirical  values. 
Slopes  other  than  1.0  indicate  empirical  distributions  which  have  more  or  less 
variance,  or  spread,  than  expected,  and  are  described  as  compressed.  If  the 
quantile-quantile  plot  is  linear,  passes  through  the  origin  and  has  slope  less  than 
one,  the  empirical  distribution  is  compressed  by  a  "scaling  factor"  less  than  one. 
Compression  of  values  can  be  corrected  by  dividing  each  value  in  the 
distribution  by  the  scaling  factor.  After  such  decompression,  the  quantile- 
quantile  plot  of  the  decompressed  data  versus  the  theoretical  data  should  be 
linear,  passing  through  the  origin  with  slope  equal  to  one,  indicating  good 
conformance  of  the  empirical  and  theoretical  distributions. 

To  determine  the  scaling  factor  needed  to  remove  the  compression  and 
the  effective  degrees  of  freedom,  the  sample  moments  of  the  empirical  fit 
statistic  distribution  can  be  used.  Specifically,  when  the  empirical  sampling 
distribution  is  shown  to  be  compressed,  it  is  the  product  of  some  constant  less 
than  one  and  the  uncompressed  chi-square  distribution 

G*2=yG2,  (2-2) 

where  G*2  represents  the  underlying  distribution  of  the  item  fit  statistics, 
assumed  to  be  a  compressed  chi-square  distribution,  y  is  the  scaling  factor  and 


40 

G2  is  a  member  of  the  chi-square  family  of  distributions.  The  expected  value  of 
any  chi-square  distribution  is  equal  to  its  degrees  of  freedom  and  the  variance  of 
any  chi-square  distribution  is  equal  to  twice  the  degrees  of  freedom 

E[G2]=v  (2-3) 

and 

Var[G2]  =  2v,  (2-4) 
where  v  is  the  degrees  of  freedom  of  the  chi-square  distribution.  The  expected 
value  of  the  compressed  distribution  would  thus  be  the  scaling  factor  multiplied 
by  the  degrees  of  freedom  of  the  uncompressed  chi-square  distribution,  and  the 
expected  variance  of  the  compressed  distribution  is  the  product  of  the  scaling 
factor  squared  and  twice  the  degrees  of  freedom  of  the  uncompressed 
distribution 

E[G*2]  =  E[yG2]  (2-5) 
=  )E[G2}  (2-6) 
=  yv  (2-7) 

and 

Var[G*2]  =  Var[yG2]  (2-8) 
=  r2Var[G2]  (2-9) 
=  y2(2v)  (2-10) 
The  mean  and  variance  of  the  sampling  distribution  of  the  item  fit  statistics  can 
be  used  as  estimates  of  the  expected  mean  and  variance  of  the  compressed 


41 

distribution 

Mean[G*2]  =  yv  (2-11) 

and 

Var[G*2]  =  2vy2,  (2-12) 

where  G*2  represents  the  empirical  distribution  of  the  item  fit  statistics  obtained 
from  the  simulated  data.  This  results  in  a  system  of  two  equations  in  two 
unknowns,  the  scaling  factor  and  the  effective  degrees  of  freedom.  This  system 
is  solvable  and  provides  estimates  for  the  scaling  factor  and  the  decompressed 
degrees  of  freedom  of  the  underlying  distribution  of  the  item  fit  statistic 

_^[G^]_  (2.13) 
2Mean[G*2] 

and 

2(Mean[G  *2  ])2 


v  = 


(2-14) 


Var[G*2] 

where  y  is  the  estimated  scaling  factor  and  v  is  the  estimated  effective  degrees 
of  freedom.  The  scaling  factor  is  divided  into  the  empirical  fit  statistic  values  to 
obtain  the  values  composing  the  decompressed  empirical  distribution. 

Once  the  scaling  factor  is  used  to  obtain  the  decompressed  distribution 
values,  the  data  form  a  distribution  that  can  be  compared  to  a  chi-square 
distribution  with  the  effective  degrees  of  freedom.  The  decompressed  fit  statistic 
distribution  should  now  correspond  very  closely  with  the  reference  chi-square 
distribution.  New  quantile-quantile  plots  and  regression  slope  and  intercept 


42 

values  can  be  used  to  compare  the  distributions.  Good  conformance  of  the 
distributions  suggests  that  the  determination  of  significance  of  item  misfit  could 
be  made  with  more  confidence  using  the  reference  distribution  with  the  effective 
degrees  of  freedom. 

Item  fit  statistics  have  not  been  shown  to  be  very  effective  in  the  detection 
of  model-data  misfit  to  date.  However,  there  is  continued  interest  in  their  use  if 
methodology  can  be  developed  to  make  detection  of  misfit  at  the  item  level  a 
feasible  prospect.  An  important  step  in  this  process  would  be  to  more  clearly 
understand  the  characteristics  of  the  sampling  distribution  of  these  statistics. 
The  reasons  behind  the  apparent  compression  of  the  data  in  comparison  to  the 
assumed  chi-square  distribution  and  the  large  difference  between  the  asymptotic 
and  empirical  degrees  of  freedom  require  further  research. 


CHAPTER  III 
PROCEDURES,  METHODS,  AND  RESULTS 
PHASE  1 


Research  Questions 

The  following  research  questions  were  explored  in  phase  1  of  the  study  for  the 

■ 

distribution  of  the  item  fit  statistics: 

Question  1 :  Are  the  distributions  of  the  item  fit  statistics  a  member  of  the  chi- 
square  family? 

Question  2:  Are  the  distributions  of  the  item  fit  statistics  compressed? 
Membership  in  the  chi-square  family  of  distributions  was  confirmed,  and  the 
anticipated  compression  was  detected  and  removed,  so  the  resulting  sampling 
distribution  was  compared  with  a  particular  member  of  the  chi-square  family  of 
distributions. 

Question  3:  After  decompression,  does  the  distributions  of  the  item  fit  statistics 
conform  to  the  chi-square  reference  distribution  with  the  effective  degrees  of 
freedom? 

Methods  and  Procedures  for  Phase  1 
In  Phase  1  of  the  study,  the  following  steps  were  completed  for  each  item: 


43 


44 

1 .  A  quantile-quantile  plot  of  the  empirical  fit  statistic  values  against  the 
theoretical  chi-square  values  for  the  1000  fit  statistic  replicates  was  created 
and  examined,  following  the  methods  described  by  Chambers  et  al.  (1983). 

2.  Linear  regression  analysis  was  run  to  determine  if  the  relationship  was  linear, 
estimate  the  strength  of  the  linear  relationship,  and  determine  the  slope  of 
the  regression  line. 

3.  For  plots  which  were  linear,  with  slope  not  equal  to  1 .0,  the  results  indicated 
a  need  for  reseating  of  the  fit  statistic  values.  Scaling  factors  were  computed 
and  applied  using  the  procedures  described  by  Stone  et  al.  (1994).  The 
value  for  the  effective  degrees  of  freedom  was  also  computed. 

4.  A  second  quantile-quantile  plot  was  constructed  for  each  item  using  the  re- 
scaled  fit  statistic  values  against  the  theoretical  chi-square  values  with  the 
effective  degrees  of  freedom. 

5.  The  linear  regression  analysis  was  repeated  using  the  re-scaled  values. 

6.  A  Kolmogorov-Smimov  (K-S)  test  was  conducted  for  each  item  to  determine  if 
the  discrepancy  between  the  decompressed  fit  statistic  distribution  and  the 
theoretical  chi-square  reference  distribution  was  statistically  significant. 


Results  of  Analysis  for  Phase  1 
An  example  of  the  quantile-quantile  plots,  for  item  7  from  the  60-item  test, 
is  shown  in  Figure  3-1 .  Quantile-quantile  plots  were  examined  for  all  items  for 
both  test  length  conditions,  and  all  are  similar  to  the  example  plot.  The 


45 


46 

regression  R2  values  and  estimated  slope  parameters  values  for  the  linear 
regression  of  the  fit  statistic  values  on  the  theoretical  chi-square  values  are 
given  in  Table  3-1  for  the  30-item  length  and  in  Table  3-2  for  the  60-item  length. 
The  high  R2  values  are  not  conclusive  evidence  of  a  linear  relationship,  as  other 
types  of  relationships  can  produce  high  linear  correlation  values.  However,  for 
the  purposes  of  the  study,  all  relationships  were  deemed  sufficiently  strong  to 
assume  linearity,  and  thus  that  the  fit  statistic  distributions  are  chi-square  in 
shape.  This  result  offers  some  evidence  to  support  an  affirmative  answer  to 
research  question  1 .  In  each  case,  the  regression  slope  is  noticeably  less  than 
one,  indicating  that  the  empirical  fit  statistic  distribution  has  less  variance  than 
the  chi-square  reference  distribution.  This  result  supports  an  affirmative  answer 
to  research  question  2. 

An  example  of  the  quantile-quantile  plots  after  the  data  were 
decompressed,  also  for  item  7  from  the  60-item  test,  is  shown  in  Figure  3-2.  The 
plots  were  examined  for  all  items  for  both  test  length  conditions,  and  all  are 
similar  to  the  example  plot.  The  relationship  between  the  decompressed  fit 
statistic  values  and  the  chi-square  values  from  the  reference  distribution 
appears  linear,  with  slope  approximately  equal  to  1.0  and  intercept 
approximately  equal  to  0.0.  The  decompression  of  the  fit  statistic  values 
resulted  in  sampling  distributions  that  closely  match  the  values  and  the 
distributional  shape  of  the  theoretical  chi-square  distributions  with  the  effective 
degrees  of  freedom  chosen  as  the  reference  distribution. 


48 


Table  3-1 :  Adjusted  R2  and  regression  slope  values  from  distribution 
comparison,  30-item  length. 


Item  Number 

 5  

Adjusted  R 

Regression  Slope 

1 

0.9365 

0.1550 

2 

0.9377 

0.1461 

3 

0.9181 

0.1515 

4 

0.8597 

0.0827 

5 

0.8924 

0.1401 

6 

0.9125 

0.1129 

7 

0.9443 

0.1152 

8 

0.9418 

0.1175 

9 

0.9262 

0.1174 

10 

0.9094 

0.0674 

11 

0.9069 

0.1228 

12 

0.9273 

0.1182 

13 

0.9257 

0.1062 

14 

0.9039 

0.0635 

15 

0.9160 

0.1158 

16 

0.8772 

0.1163 

17 

0.8891 

0.1187 

18 

0.8866 

0.1112 

19 

0.9181 

0.1096 

20 

0.9218 

0.1047 

21 

0.9218 

0.1354 

22 

0.9467 

0.1480 

23 

0.9121 

0.1061 

24 

0.9369 

0.1172 

25 

0.9295 

0.1132 

26 

0.9110 

0.1228 

27 

0.9154 

0.1043 

28 

0.9159 

0.1643 

29 

0.9048 

0.1796 

30 

0.8816 

0.1016 

49 


Table  3-2:  Adjusted  R2  and  regression  slope  values  from  distribution 
comparison,  60-item  length. 


Item  Number 

Adjusted  R2 

Regression  Slope 

1 

0.9450 

0.2286 

2 

0.9227 

0.2283 

3 

0.9593 

0.2172 

4 

0.9126 

0.1554 

5 

0.9504 

0.2118 

6 

0.9230 

0.1883 

7 

0.9523 

0.2164 

8 

0.9455 

0.1999 

9 

0.9586 

0.2097 

10 

0.9304 

0.1923 

11 

0.9201 

0.2079 

12 

0.8514 

0.1618 

13 

0.9759 

0.1955 

14 

0.9575 

0.2106 

15 

0.9659 

0.1862 

16 

0.9547 

0.1503 

17 

0.9623 

0.1855 

18 

0.9611 

0.1919 

19 

0.9707 

0.1856 

20 

0.9248 

0.1492 

21 

0.9535 

0.2064 

22 

0.9373 

0.2009 

23 

0.9418 

0.1862 

24 

0.9415 

0.1825 

25 

0.9430 

0.1754 

26 

0.9413 

0.1771 

27 

0.9656 

0.1936 

28 

0.9597 

0.2083 

29 

0.9574 

0.1679 

30 

0.9524 

0.2028 

31 

0.9446 

0.1781 

32 

0.9197 

0.1772 

33 

0.9215 

0.1816 

34 

0.9523 

0.2149 

35 

0.9566 

0.2324 

36 

0.9284 

0.1703 

37 

0.9474 

0.2183 

38 

0.9652 

0.1759 

39 

0.9295 

0.1896 

50 


Table  3-2  -  continued 


Item  Number 

 K  

Adjusted  R 

Regression  Slope 

40 

0.9325 

0.1891 

41 

0.9587 

0.1733 

42 

0.9368 

0.1719 

43 

0.9437 

0.1402 

44 

0.9428 

0.1887 

45 

0.9516 

0.1871 

46 

0.9392 

0.1707 

47 

0.9367 

0.1324 

48 

0.9609 

0.1660 

49 

0.9612 

0.1797 

50 

0.9518 

0.1463 

51 

0.9470 

0.1774 

52 

0.9563 

0.1722 

53 

0.9476 

0.1637 

54 

0.9548 

U.  1  oUz 

55 

0.9466 

0.1423 

56 

0.9388 

0.1763 

57 

0.9785 

0.1606 

58 

0.9449 

0.1780 

59 

0.9676 

0.1797 

60 

0.9379 

0.1709 

51 

The  regression  R2  values,  regression  slopes,  and  regression  intercepts 
are  presented  in  Table  3-3  for  the  30-item  test  length.  Similar  values  are 
presented  in  Table  3-4  for  the  60-item  test  length.  In  these  tables,  all 
distribution  comparisons  demonstrate  very  strong  linear  relationships.  The 
regression  slope  values  are  all  near  1.0  and  the  regression  intercepts  are  all 
near  0,  indicating  good  fit  between  the  decompressed  fit  statistic  distribution  and 
the  reference  chi-square  distribution  with  the  effective  degrees  of  freedom.  This 
result  offers  evidence  to  support  an  affirmative  answer  to  research  question  3. 

Kolmogorov-Smirnov  (K-S)  tests  are  designed  to  test  whether  there  is  a 
statistically  significant  discrepancy  between  an  observed  frequency  distribution 
and  a  theoretical  distribution.  K-S  tests  were  conducted  to  determine  if  the 
discrepancies  between  the  decompressed  fit  statistic  distribution  and  the 
theoretical  reference  chi-square  distribution  were  statistically  significant.  A 
statistically  significant  test  result  would  indicate  that  the  decompression 
procedure  used  was  ineffective  and  that  the  decompressed  distribution  of  fit 
statistics  still  did  not  conform  well  to  the  reference  chi-square  distribution  with 
the  effective  degrees  of  freedom.  The  re-scaling  procedure  should  have 
resulted  in  fit  statistic  distributions  that  agreed  with  the  reference  distribution  as 
closely  as  possible.  Test  results  that  were  not  statistically  significant  would 
provide  only  negative  evidence  that  the  re-scaling  procedure  was  effective  and 
that  the  conformance  between  the  reference  and  fit  statistic  distributions  was 
adequate. 


52 


Table  3-3:  Adjusted  R2,  regression  slope  and  regression  intercept  values  from 
distribution  comparison  after  data  decompression,  30-item  length. 


Item  Number 

Adjusted  R2 

Regression  Slope 

Regression 
Intercept 

1 

0.9958 

0.9990 

0.0054 

2 

0.9960 

0.9991 

0.0051 

3 

0.9938 

0.9981 

0.0094 

4 

0.9894 

0.9965 

0.0096 

5 

0.9854 

0.9940 

0.0234 

6 

0.9971 

0.9999 

0.0009 

7 

0.9983 

1.0003 

-0.0005 

8 

0.9935 

0.9978 

0.0128 

9 

0.9937 

0.9979 

0.0112 

10 

0.9935 

0.9981 

0.0086 

11 

0.9820 

0.9921 

0.0370 

12 

0.9922 

0.9972 

0.0149 

13 

0.9953 

0.9989 

0.0059 

14 

0.9939 

0.9983 

0.0072 

15 

0.9938 

0.9981 

0.0088 

16 

0.9661 

0.9841 

0.0679 

17 

0.9822 

0.9923 

0.0353 

18 

0.9822 

0.9923 

0.0327 

19 

0.9945 

a  Ann  ^ 

0.9984 

0.0080 

20 

0.9958 

0.9992 

0.0041 

21 

0.9908 

0.9964 

0.0197 

22 

0.9975 

0.9998 

0.0020 

23 

0.9906 

0.9965 

0.0167 

24 

n  QQ/17 

u.yy4/ 

n  nnR9 

25 

0.9949 

0.9986 

0.0079 

26 

0.9972 

1.0001 

0.0005 

27 

0.9903 

0.9964 

0.0159 

28 

0.9923 

0.9973 

0.0140 

29 

0.9786 

0.9903 

0.0545 

30 

0.9968 

1.0002 

0.0001 

Mean 

0.9913 

0.9969 

0.0149 

53 


Table  3-4:  Adjusted  R2,  regression  slope,  and  regression  intercept  values  from 
distribution  comparison  after  data  decompression,  60-item  length. 


Item  Number 

Adjusted  R 

Regression  Slope 

Regression 
Intercept 

1 

0.9958 

0.9988 

0.0078 

2 

0.9982 

0.9951 

0.0288 

3 

0.9981 

0.9999 

0.0013 

4 

0.9940 

0.9984 

0.0070 

5 

0.9978 

1.0000 

0.0014 

6 

0.9867 

0.9944 

0.0298 

7 

0.9978 

0.9998 

0.0018 

8 

0.9962 

0.9992 

0.0051 

9 

0.9980 

0.9998 

0.0020 

10 

0.9973 

0.9998 

0.0015 

11 

0.9920 

0.9971 

0.0145 

12 

0.9615 

0.9820 

0.0630 

13 

0.9986 

1 .0000 

0.0010 

14 

0.9948 

0.9981 

0.0160 

15 

0.9955 

0.9985 

0.0133 

16 

0.9946 

0.9982 

0.0117 

17 

0.9973 

0.9994 

0.0053 

18 

0.9855 

0.9985 

0.0129 

19 

0.9984 

1.0000 

0.0009 

20 

0.9951 

0.9987 

0.0070 

21 

0.9935 

0.9977 

0.0147 

22 

0.9948 

0.9984 

0.0103 

23 

0.9946 

0.9982 

0.0126 

24 

0.9943 

0.9980 

0.0133 

25 

0.9955 

0.9987 

0.0090 

26 

0.9970 

0.9995 

0.0033 

27 

0.9975 

0.9995 

0.0049 

28 

0.9969 

0.9993 

0.0060 

29 

0.9970 

0.9993 

0.0057 

30 

0.9968 

0.9994 

0.0044 

31 

0.9924 

0.9971 

0.0215 

32 

0.9801 

0.9910 

0.0507 

33 

0.9925 

0.9973 

0.0144 

34 

0.9963 

0.9990 

0.0083 

35 

0.9948 

0.9982 

0.0150 

36 

0.9977 

1 .0002 

0.0000 

37 

0.9959 

0.9988 

0.0081 

38 

0.9979 

0.9998 

0.0025 

54 


Table  3-4  -  continued 


Item  Number 

 x  

Adjusted  R 

Regression  Slope 

Regression 
Intercept 

39 

0.9915 

0.9967 

0.0212 

40 

0.9970 

0.9997 

0.0024 

41 

0.9978 

0.9997 

0.0026 

42 

0.9908 

0.9963 

0.0257 

43 

0.9932 

0.9976 

0.0141 

44 

0.9907 

0.9962 

0.0291 

45 

0.9939 

0.9977 

0.0175 

46 

0.9932 

0.9975 

0.0167 

47 

0.9952 

0.9986 

0.0092 

48 

0.9975 

0.9996 

0.0035 

49 

0.9965 

0.9991 

0.0070 

50 

0.9980 

1.0000 

0.0009 

51 

0.9960 

0.9989 

0.0082 

52 

0.9971 

0.9994 

0.0048 

53 

0.9948 

0.9983 

0.0125 

54 

0.9951 

0.9984 

0.0125 

55 

0.9942 

0.9981 

0.01 16 

56 

0.9903 

0.9959 

0.0315 

57 

0.9988 

1 .0002 

-0.0006 

58 

0.9922 

0.9968 

0.0256 

59 

0.9988 

1.0003 

-0.0012 

60 

0.9752 

0.9880 

0.1595 

Mean 

0.9941 

0.9980 

0.0142 

55 

The  results  of  the  test  are  included  in  Table  3-5.  If  any  of  the  test  values 
exceeded  the  critical  value  in  the  K-S  test,  it  would  indicate  that  the 
decompressed  fit  statistic  distribution  was  significantly  different  from  the 
reference  chi-square  distribution  with  the  effective  degrees  of  freedom.  None  of 
the  results  of  the  K-S  tests  was  significant,  lending  additional  support  for  an 
affirmative  answer  to  research  question  3. 

Table  3-6  contains  the  values  for  the  scaling  factor  and  the  effective 
degrees  of  freedom  for  the  30-item  length  test.  Table  3-7  contains  the  values  for 
the  scaling  factor  and  the  effective  degrees  of  freedom  for  the  60-item  length 
test. 

Summary  of  Findings 
In  Phase  1  of  the  study,  three  hypotheses  were  tested.  The  distributions 
of  the  fit  statistics  were  found  to  be  members  of  the  chi-square  family  through 
use  of  quantile-quantile  plots  and  regression  analysis,  thus  supporting  an 
affirmative  answer  to  research  question  1 .  The  regression  analysis  also  offered 
support  for  an  affirmative  answer  to  research  question  2,  in  that  the  slopes  of  the 
regression  lines  were  not  equal  to  1 .0,  indicating  that  the  fit  statistic  data 
distribution  was  compressed.  Sampling  distribution  means  and  variances  were 
computed  for  the  fit  statistic  distribution  for  each  item,  and  used  to  calculate  (a) 
the  scaling  factor,  needed  to  remove  the  data  compression,  and  (b)  the  effective 
degrees  of  freedom,  used  to  identify  the  reference  distribution  from  the  chi- 
square  family. 


56 


Table  3-5:  Kolmogorov-Smirnov  Test  Results,  K-Scriticai  =  0.05155 


Item  Number, 

K-S  Test  Value 

Item  Number, 

K-S  Test  Value 

30-item  test 

60-item  test 

1 

0.0193 

1 

0.0314 

2 

0.0205 

2 

0.0308 

3 

0.0243 

3 

0.0136 

4 

0.0350 

4 

0.0162 

5 

0.0382 

5 

0.0155 

6 

0.0219 

6 

0.0243 

7 

0.0097 

7 

0.0208 

8 

0.0206 

8 

0.0173 

9 

0.0267 

9 

0.0158 

10 

0.0318 

10 

0.0274 

11 

0.0243 

11 

0.0289 

12 

0.0217 

12 

0.0474 

13 

0.0240 

13 

0.0197 

14 

0.0353 

14 

0.0176 

15 

0.0294 

15 

0.0226 

16 

0.0400 

16 

0.0146 

17 

0.0437 

17 

0.0195 

18 

0.0456 

18 

0.0212 

19 

0.0304 

19 

0.0171 

20 

0.0200 

20 

0.0292 

21 

0.0320 

21 

0.0180 

22 

0.0173 

22 

0.0228 

23 

0.0256 

23 

0.0255 

24 

0.0175 

24 

0.0261 

25 

0.0229 

25 

0.0232 

26 

0.0204 

26 

0.0299 

27 

0.0202 

27 

0.0165 

28 

0.0305 

28 

0.0195 

29 

0.0343 

29 

0.0130 

30 

0.0143 

30 

0.0201 

- 

- 

31 

0.0329 

- 

- 

32 

0.0323 

33 

0.0341 

34 

0.0245 

35 

0.0217 

36 

0.0393 

37 

0.0229 

38 

0.0155 

39 

0.0307 

57 


Table  3-5  -  continued 


Item  Number, 

K-S  Test  Value 

Item  Number, 

K-S  Test  Value 

30-item  test 

60-item  test 

- 

- 

40 

0.0201 

- 

- 

41 

0.0200 

- 

- 

42 

0.0323 

- 

- 

43 

0.0179 

- 

- 

44 

0.0328 

- 

- 

45 

0.0248 

- 

- 

46 

0.0291 

- 

- 

47 

0.0307 

- 

- 

48 

0.0134 

- 

- 

49 

0.0198 

- 

- 

50 

0.0307 

- 

- 

51 

0.0234 

- 

- 

52 

0.0164 

- 

- 

53 

0.0262 

54 

0.0169 

55 

0.0091 

56 

0.0369 

57 

0.0142 

58 

0.0275 

59 

0.0080 

60 

0.0434 

58 


Table  3-6:  Scaling  factor  and  effective  degrees  of  freedom,  30-item  length 


Item  Number 

Scaling  Factor 

Effective  Degrees 
of  Freedom 

1 

0.6438 

4.7653 

2 

0.6005 

4.8614 

3 

0.6504 

4.5481 

4 

0.4877 

2.5773 

5 

0.6672 

3.8031 

6 

0.5418 

3.6643 

7 

0.4667 

4.9690 

8 

0.4576 

5.3906 

9 

0.4768 

5.0391 

10 

0.3083 

4.0457 

11 

0.5271 

4.6093 

12 

0.4793 

5.0510 

13 

0.4568 

4.4956 

14 

0.2981 

3.8600 

15 

0.5094 

4.3416 

16 

0.5293 

4.2335 

17 

0.5226 

4.4630 

18 

0.5083 

4. 1 536 

19 

0.4676 

4.6082 

20 

0.4735 

4.0825 

21 

0.5362 

5.3219 

22 

0.5700 

0.4o12 

23 

0.4670 

4.5226 

24 

A  >17QQ 

25 

0.4642 

4.9228 

26 

0.6006 

3.5319 

27 

0.4651 

4.2298 

28 

0.6812 

4.8865 

29 

0.7049 

5.5234 

30 

0.5856 

2.6257 

Table  3-7:  Scaling  factor  and  effective  degrees  of  freedom,  60-item  length 


Item  Number 

Scaling  Factor 

Effective  Degrees  of 
Freedom 

1 

0.8397 

6.0340 

2 

0.8738 

5.6944 

3 

0.7459 

6.8061 

4 

0.7341 

3.7797 

5 

0.8004 

5.6719 

6 

0.7544 

5.1931 

7 

0.7793 

6.2331 

8 

0.7922 

5.1812 

9 

0.7057 

7.0933 

10 

0.8177 

4.5732 

11 

0.8691 

4.7879 

12 

0.8262 

3.4647 

13 

0.5728 

9.1898 

14 

0.6631 

8.1085 

15 

0.5774 

8.2860 

16 

0.5449 

6.1306 

17 

0.5909 

7.8839 

18 

0.6027 

8.1193 

19 

0.6021 

7.5335 

20 

0.6241 

4.7557 

21 

0.7542 

6.0437 

22 

0.7517 

5.8657 

23 

0.6552 

6.5973 

24 

0.6534 

6.3799 

25 

0.6348 

6.2324 

26 

0.6878 

5.4178 

27 

0.6005 

8.2853 

28 

0.6974 

7.1582 

29 

0.5582 

7.2701 

30 

0.7548 

5.8341 

31 

0.6061 

7.0375 

32 

0.6887 

5.5381 

33 

0.7356 

5.0882 

34 

0.7149 

7.3030 

35 

0.7501 

7.7214 

36 

0.7839 

3.9142 

37 

0.7798 

6.3641 

38 

0.5790 

7.3571 

39 

0.6971 

6.1275 

60 


Table  3-7  -  continued 


Item  Number 

Scaling  Factor 

Effective  Degrees  of 
Freedom 

40 

0.7936 

4.6865 

41 

0.5865 

7.0108 

42 

0.6017 

6.7053 

43 

0.5350 

5.6012 

44 

0.6279 

7.3742 

45 

0.6174 

7.4260 

46 

0.6116 

6.3861 

47 

0.4945 

5.8941 

48 

0.5605 

7.0269 

49 

0.6115 

6.9168 

50 

0.5488 

5.7511 

51 

0.6241 

6.5650 

52 

0.5949 

6.7396 

53 

0.5701 

6.6993 

54 

0.5041 

7  4  CAC\ 

/ .  1 04U 

55 

0.5385 

5.6783 

56 

0.5817 

7.5307 

57 

0.4952 

8.2733 

58 

0.5732 

7.8535 

59 

0.6196 

6.6895 

60 

0.4251 

13.2580 

61 

The  scaling  factors  were  used  to  remove  the  data  compression,  and  the 
conformance  between  the  decompressed  fit  statistic  distribution  and  the 
theoretical  chi-square  reference  distribution  was  assessed.  New  quantile- 
quantile  plots  and  new  linear  regressions  were  developed,  and  K-S  tests  were 
used  to  offer  further  evidence  of  acceptable  similarity  between  the  distributions, 
all  evidence  supporting  an  affirmative  answer  to  research  question  3. 


CHAPTER  IV 
PROCEDURES,  METHODS  AND  RESULTS 
PHASE  2 


Hypotheses 

The  following  hypotheses  were  evaluated  in  phase  2  of  the  study  for  the 
distribution  of  the  item  fit  statistics: 

Hi1:  The  item  parameters  will  be  a  statistically  significant  predictor  of  the 
scaling  factor  calculated  from  the  sample  moments  of  the  distribution. 
Hi2:  The  item  parameters  will  be  a  statistically  significant  predictor  of  the 
effective  degrees  of  freedom  calculated  from  the  sample  moments  of  the 
distribution. 

H^:  The  maximum  value  of  the  item  information  function  will  be  a  statistically 
significant  predictor  of  the  effective  degrees  of  freedom  calculated  from  the 
sample  moments  of  the  distribution  when  included  in  the  multiple  regression 
model  having  the  item  parameters  as  independent  variables. 
Hi4:  The  variance  of  a  sample  of  item  information  values  will  be  a  statistically 
significant  predictor  of  the  effective  degrees  of  freedom  calculated  from  the 
sample  moments  of  the  distribution  when  included  in  the  multiple  regression 
model  having  the  item  parameters  as  independent  variables. 


62 


63 

Hi5:  The  maximum  value  of  the  item  information  function  will  be  a  statistically 
significant  predictor  of  the  scaling  factor  used  to  decompress  the  data  calculated 
from  the  sample  moments  of  the  distribution  when  included  in  the  multiple 
regression  model  having  the  item  parameters  as  independent  variables. 
H16:  The  variance  of  a  sample  of  item  information  values  will  be  a  statistically 
significant  predictor  of  the  scaling  factor  used  to  decompress  the  data  calculated 
from  the  sample  moments  of  the  distribution  when  included  in  the  multiple 
regression  model  having  the  item  parameters  as  independent  variables. 
Hi7:  The  number  of  effectively  independent  intervals  will  be  a  statistically 
significant  predictor  of  the  effective  degrees  of  freedom  calculated  from  the 
sample  moments  of  the  distribution  when  included  in  the  multiple  regression 
model  having  the  item  parameters  as  independent  variables. 
H18:  The  number  of  effectively  independent  intervals  will  be  a  statistically 
significant  predictor  of  the  scaling  factor  used  to  decompress  the  data  calculated 
from  the  sample  moments  of  the  distribution  when  included  in  the  multiple 
regression  model  having  the  item  parameters  as  independent  variables. 

Methods  and  Procedures  for  Phase  2 
The  values  for  the  effective  degrees  of  freedom  and  scaling  factor  established  in 
phase  1  of  the  study  were  used  as  dependent  variable  values  in  phase  2  of  the 
study.  Item  parameters  for  difficulty,  discrimination,  and  guessing  used  in  this 
study  were  sampled  from  the  1992  National  Assessment  of  Educational 
Progress  (NAEP)  Reading  and  Mathematics  Assessments  (Isham  &  Donoghue, 


64 

1995).  Twelve  items  from  the  2PL  NAEP  items  and  48  items  from  the  3PL  NAEP 
items  were  randomly  sampled  without  replacement  for  the  60-item  condition. 
The  30-item  test,  Test  1 ,  consists  of  the  first  6  2PL  items  and  the  first  24  3PL 
items  drawn  for  the  60-item  test,  Test  2  (see  Table  4-1  for  item  parameters). 
Item  parameters  include  values  across  a  wide  range,  although  few  items  with 
extreme  parameters  were  included  in  the  set. 
Item  Information 

The  most  commonly-used  measure  of  item  information  is  the  Fisher 
information  function.  It  is  defined  as: 

m        W»  (4-1) 

where  /  indexes  items,  P,(0)is  the  function  given  in  equation  1-2  and         is  the 
first  derivative  of  the  function  (van  der  Linden  &  Hambleton,  1997).  The 
information  values  provided  by  individual  items  can  be  summed  to  provide  an 
estimate  of  test  information.  This  additivity  is  an  important  property  of  item 
information,  allowing  test  information  targets  to  be  set  and  tests  to  be 
constructed  to  meet  them  by  the  inclusion  or  exclusion  of  particular  items. 
Maximum  Values  of  Item  Information 

Item  information  functions  provide  a  maximum  amount  of  information  for 
examinees  of  a  particular  ability  estimate.  This  point  on  the  theta  scale  is 
defined  as: 


65 


Table  4-1 :  Item  Parameters 


Item  #,  Test  1 

Item  #,  Test  2 

a  parameter 

b  parameter 

c  parameter 

1 

1 

0.6344 

0.5152 

0.0000 

2 

2 

0.9103 

1 .2776 

0.0000 

3 

3 

0.6971 

0.9479 

0.0000 

4 

4 

1 .4699 

-1.1183 

0.0000 

5 

5 

1.2140 

1.5539 

0.0000 

6 

6 

1.6323 

1.0181 

0.0000 

7 

7 

1.1742 

0.4934 

0.0990 

8 

8 

0.9584 

1.1563 

0.1015 

9 

9 

1 .2272 

1 .4874 

0.1049 

10 

10 

1.8081 

-0.2634 

0.1229 

11 

11 

0.9373 

1.8389 

0.1250 

12 

12 

0.8269 

1 .4903 

0.1258 

13 

13 

1 .0437 

0.5452 

0.1387 

14 

14 

1.7666 

-0.4732 

0.1440 

15 

15 

0.8848 

0.0567 

0.1456 

16 

16 

0.8852 

0.0055 

0.1505 

17 

17 

1.7953 

1 .6403 

0. 1 522 

18 

18 

1.3065 

1.3500 

r\    A  C  A  A 

0. 1 544 

19 

19 

1.9039 

1.3259 

a  4  CAO 

0.1608 

20 

20 

1.1189 

-0.3338 

A  4  con 

0.1629 

21 

21 

0.4909 

2.1383 

0.1  o44 

22 

22 

0.4440 

-0.7308 

0.1  bob 

23 

23 

0.9719 

A    A  A  OA 

1 .1 181 

A  4  7AO 
0.1  /02 

24 

24 

0.8473 

0.0979 

A  4  71  A 
0.1/10 

25 

25 

1.3802 

1 .7239 

A  4  "74  1 
0.1/12 

26 

26 

0.9261 

-1.0976 

0.1748 

27 

27 

1.1230 

A  O  fi  A 

-0.4884 

f>    a  IMA 

0.1839 

28 

28 

0.4210 

-0.7514 

0. 1 844 

29 

29 

0.6251 

4.3601 

0.1852 

30 

30 

0.8628 

-2.5607 

0.1904 

- 

31 

0.7669 

0.4562 

0.0000 

- 

32 

0.7719 

-1 .6378 

0.0000 

- 

33 

0.7305 

0.6562 

0.0000 

34 

0.7973 

-2.0177 

U.UUUU 

35 

1 .4358 

0.6058 

0.0000 

36 

1.7718 

-0.3507 

0.0000 

37 

0.5067 

-0.9875 

0.1915 

38 

0.7193 

0.3099 

0.1951 

39 

2.1832 

1.3839 

0.2028 

40 

0.8740 

-1 .4867 

0.2073 

66 


Table  4-1  -  continued 


Item  #,  Test  1 

Item  #,  Test  2 

a  parameter 

b  parameter 

c  parameter 

- 

41 

0.8599 

0.5531 

0.2110 

- 

42 

0.8791 

0.0447 

0.2131 

- 

43 

1.6919 

0.0175 

0.2133 

- 

44 

0.5420 

0.8224 

0.2195 

- 

45 

0.6042 

0.0508 

0.2266 

- 

46 

1 .0685 

0.9824 

0.2395 

- 

47 

1.8468 

-0.0118 

0.2423 

- 

48 

1.0143 

0.7219 

0.2427 

- 

49 

0.7947 

0.3135 

0.2435 

- 

50 

1.5680 

-0.1826 

0.2458 

- 

51 

0.9523 

0.0089 

0.2564 

- 

52 

0.3043 

0.8143 

0.2569 

- 

53 

1.4925 

0.6814 

0.2619 

- 

54 

1.4214 

0.1400 

0.2642 

- 

55 

1.7055 

-0.4783 

r\  0770 

0.2772 

- 

56 

1 .0642 

1.2761 

0.3019 

- 

57 

1.2325 

-0. 1 777 

0.3030 

■ 

58 

A     /~\  A  A  /"> 

1.2416 

0.6847 

0.3163 

59 

1 .2670 

r\   a  -7*7  >i 

-0.4774 

0.32UZ 

en 

I .  / DDU 

D  7R90 

W.*T  1  \J\J 

Minimum  -  30 

0.4210 

-2.5607 

0.0000 

Minimum  -  60 

0.3043 

-2.5607 

0.0000 

Maximum  -  30 

1.9039 

4.3601 

0.1904 

Maximum  -  60 

2.1832 

4.3601 

0.4106 

Mean  -  30 

1.0762 

0.6108 

0.1216 

Mean  -  60 

1.1027 

0.3637 

0.1619 

67 


e 


max 


In 


1  +  (1  +  8c, ) 
2 


(4-2) 


where  a,,  b„  and  c,  are  the  item  parameters  and  D  is  the  scaling  constant,  -1 .7. 
Two  facets  of  the  item  information  function  will  be  considered  as  a  predictor  of 
the  effective  degrees  of  freedom  and  scaling  factor,  the  maximum  value  and  the 
variance.  The  maximum  information  for  the  item  is  defined  as: 


where  a,-,  b„  and  c,  are  the  item  parameters  and  D  is  a  constant  defined  as  -1.7. 
Variance  of  the  Item  Information  Values 

An  estimate  of  the  variance  of  the  item  information  values  was  calculated. 
A  sample  of  theta  values  was  generated,  centered  at  the  point  on  the  theta  scale 
at  which  the  maximum  information  is  provided.  The  theta  values  were  chosen  to 
encompass  a  range  from  4.0  theta  units  below  the  point  of  maximum  information 
to  4.0  theta  units  above  that  point.  The  values  were  selected  at  intervals  of  0.01 , 
resulting  in  a  sample  size  of  801 .  At  each  of  these  theta  values,  the  item 
information  function  value  was  calculated  as  follows: 


8(l-c,2)L 


1- 20c,  -8c,2  +(l  +  8c,) 


(4-3) 


1,(0)  = 


(4-4) 


where 


(4-5) 


and 


68 


Q,=l-P, 


(4-6) 


From  this  sample  of  item  information  function  values,  the  variance  of  the  item 
information  function  over  this  range  was  estimated. 
Effectively  Independent  Intervals 

It  seemed  possible  that  the  effective  degrees  of  freedom  of  the  chi-square 
fit  statistic  would  be  directly  related  to  the  number  of  intervals  that  reasonably 
could  be  judged  to  be  independent  in  the  observed  theta  distribution  developed 
using  pseudocounts.  This  variable,  the  number  of  "effectively  independent 
intervals,"  was  defined  in  this  study  as  described  below. 

In  an  IRT  model,  examinee  theta  values  can  be  estimated  with  a 
maximum  likelihood  estimator  from  the  examinee  item  responses  and  the  item 
parameters.  This  theta  value  has  an  estimable  standard  error  of  measurement, 
which  is  defined  as  the  inverse  of  the  square  root  of  the  test  information  function. 
Under  the  assumption  of  local  independence,  the  test  information  function  is  the 
sum  of  the  individual  item  information  functions,  so  the  formula  for  the  standard 
error  of  measurement  of  theta  is: 


where  /  indexes  the  items  and  7,  (0)  represents  the  item  information  functions. 
The  standard  error  of  measurement  was  used  to  create  a  confidence  interval 
around  the  theta  estimate. 


(4-7) 


69 

The  point  on  the  theta  scale  that  provides  the  maximum  item  information 
was  used  as  the  center  point  of  the  first  interval.  The  standard  error  value  was 
used  to  define  the  upper  and  lower  bounds  of  a  confidence  interval  about  the 
theta  value  as  follows: 

LB  =  e-za/[SE(0)]  (4-8) 

and 

UB  =  0+za/[SE(0)]  (4-9) 

with  the  z  value  appropriate  to  the  confidence  level  being  calculated. 
Confidence  levels  considered  in  this  study  were  50%,  55%,  60%,  65%,  70%, 
75%,  80%,  85%,  90%,  95%,  and  99%.  Once  the  boundaries  of  the  first  interval 
were  established,  an  estimate  of  the  theta  value  to  be  used  as  the  center  of  the 
next  interval  down  the  theta  scale  was  calculated  using  the  above  standard  error 
value  as  follows: 

0  =  LB-za/[SE(0)].  (4-10) 

The  information  provided  by  the  test  at  this  theta  value  and  the  standard  error  at 
this  point  were  calculated  as  above. 

The  bounds  for  this  new  interval  were  calculated  and  tested  for  overlap 
with  the  previous  interval.  If  the  interval  bounds  overlapped,  a  revised  0  was 
chosen  and  the  process  repeated  until  the  intervals  did  not  overlap.  The 
endpoints  of  the  intervals  were  also  required  to  be  adjacent.  Adjacent  intervals 
were  defined  for  the  purpose  of  this  study  as  being  within  0.001  of  each  other 


70 

without  overlapping.  Estimation  of  the  center  point  of  the  new  interval  was 
adjusted  until  the  endpoints  of  the  confidence  intervals  met  both  the  adjacent 
and  the  non-overlapping  criteria.  When  both  criteria  were  met,  the  central  point 
of  the  new  interval  was  accepted  as  the  next  effectively  independent  theta  value 
in  the  distribution. 

Each  time  an  interval  width  was  established,  another  theta  value  was 
chosen  and  a  new  interval  width  estimated.  This  process  continued  in  both 
directions  from  the  initial  theta  value  until  the  upper  and  lower  bounds  of  the 
simulated  interval  were  reached,  and  the  total  number  of  intervals  used  to 
encompass  the  entire  theta  range  of  interest  was  totaled.  If  the  interval  at  the 
top  or  bottom  of  the  distribution  was  found  to  extend  beyond  the  bound,  the 
proportion  of  the  interval  that  fell  within  the  bound  was  counted  toward  the 
number  of  effectively  independent  intervals.  Since  chi-square  tests  do  not 
require  that  the  degrees  of  freedom  be  an  integer,  allowing  partial  intervals  to 
contribute  to  the  total  did  not  cause  the  total  number  of  effectively  independent 
intervals  to  be  an  impossible  value  for  the  degrees  of  freedom.  As  shown  in 
Tables  3-5  and  3-6,  none  of  the  effective  degrees  of  freedom  values  estimated 
for  this  study  in  Phase  1  was  an  integer. 
Modifications  to  Procedure  for  Extreme  Theta  Ranges 

The  procedure  used  for  counting  the  number  of  effectively  independent 
intervals  required  some  modification  when  the  bounds  of  the  theta  range,  -4.0 
and  4.0,  were  approached.  For  some  items,  the  end  interval  boundaries  met  the 


71 

criteria  of  non-overlapping  and  adjacency,  but  either  the  upper  interval  boundary 
was  greater  than  4.0  or  the  lower  interval  boundary  was  less  than  -4.0.  In  this 
case,  the  proportion  of  the  end  interval  that  lay  within  the  bounds  was  included 
in  the  count.  However,  the  procedure  used  to  determine  the  boundaries  of  the 
effectively  independent  intervals  proved  unusable  on  many  items  for  one  or  both 
of  the  two  most  extreme  intervals,  one  at  the  upper  end  and  one  at  the  lower  end 
of  the  theta  range. 

When  this  occurred,  a  set  of  rules  was  developed  so  that  the  remaining 
portion  of  the  theta  range,  the  distance  from  the  previous  cell  boundary  to  -4.0  or 
to  4.0,  was  included  as  part  of  the  interval  count.  The  rules  used  were  as 
follows: 

1 .  Try  theta  values  within  the  predetermined  range  as  possible  central  points  of 
the  new  interval  to  verify  that  the  previous  and  new  interval  boundaries 
overlap  throughout. 

2.  Set  the  center  point  of  the  end  interval  starting  from  the  boundary  of  the 
previous  interval  and  moving  toward  the  endpoint  of  the  range  a  distance 
equal  to  one-half  of  the  interval  width  established  in  the  previous  interval. 

3.  Calculate  the  test  information  and  the  standard  error  of  measurement  using 
this  center  point  of  the  end  interval.  This  standard  error  value  will  be  used  to 
determine  the  interval  width  of  the  end  interval  using  the  appropriate  z-score, 
as  in  equations  4-8  and  4-9. 


72 

4.  Determine  the  portion  of  the  range  left  unaccounted  for  by  measuring  the 
distance  from  the  previous  interval  boundary  to  4.0  or  -4.0.  Divide  this 
distance  by  the  interval  width  established  in  step  3.  This  value  is  the 
proportion  of  the  end  interval  that  falls  into  the  established  range. 
By  following  this  procedure,  all  interval  counts  for  all  items  and  all  confidence 
levels  were  performed  over  the  same  theta  range,  -4.0  to  4.0. 

The  total  effectively  independent  interval  counts  for  each  item  on  the  30- 
item  test  for  each  confidence  level  are  presented  in  Table  4-2.  The  total 
effectively  independent  interval  counts  for  each  item  on  the  60-item  test  for  each 
confidence  level  are  presented  in  Table  4-3.  For  most  items,  the  interval  counts 
decreased  as  the  confidence  level,  and  thus  the  width  of  the  interval,  increased. 
There  was  a  strong  negative  correlation  between  the  confidence  level  and 
interval  counts,  averaging  -0.98865  over  30  items  and  -0.99292  over  60  items. 
These  correlation  values  appear  in  the  last  column  of  Table  4-2  and  of  Table 
4-3.  The  strength  of  these  values  indicates  that  the  interval  counts  could  be 
accurately  predicted  for  confidence  levels  other  than  those  measured  directly  for 
this  study. 

Estimating  Population  Validity  and  Cross-Validity  Coefficients 

Multiple  regression  models  using  either  the  scaling  factor  or  the  effective 
degrees  of  freedom  as  the  dependent  variable  were  evaluated  for  effectiveness 
of  prediction.  The  independent  variables  for  the  models  included  the  item 
parameters,  the  interval  counts  at  all  confidence  levels,  the  item  information 


73 


function  maximum  value,  and  the  item  information  function  sample  variance.  In 
multiple  regression,  the  distance  between  the  actual  and  predicted  values  is 
minimized.  This  procedure  can  capitalize  on  properties  unique  to  the  sample 
being  used  to  develop  the  model,  and  the  resulting  squared  multiple  correlation 
(R2)  value  may  overestimate  the  value  of  the  squared  multiple  correlation  in  the 
population  (population  validity)  or  in  other  samples  drawn  from  the  same 
population  (population  cross-validity). 

Estimates  of  the  population  validity,  denoted  R2 ,  and  the  population 

cross-validity,  denoted  /?* ,  were  provided  for  all  multiple  regression  models 
evaluated.  The  value  presented  as  an  estimate  of  the  population  validity  was 
calculated  from  a  formula  given  by  Ezekiel  (1930,  as  cited  in  Raju,  Bilgic, 
Edwards,  &  Fleer,  1997),  specifically: 


where  R2  is  the  squared  multiple  correlation,  N  is  the  sample  size,  and  n  is  the 
number  of  predictors.  The  estimate  of  the  population  cross-validity  was 
calculated  using  a  procedure  developed  by  Browne  (1975,  as  cited  in  Crocker  & 
Algina,  1986),  specifically: 


N-\ 


(4-11) 


N-n-1 


(N-n-3)R4c+R2c 
(N-2n-2)R2  +n 


(4-12) 


where 


R,  *M2) 


(4-13) 


N-n-\ 


74 


Table  4-2:  Effectively  Independent  Interval  Counts,  30-item  test  length 


CI 
Level 

0.50 

0.55 

0.60 

0.65 

0.70 

0.75 

0.80 

0.85 

0.90 

0.95 

0.99 

Corr 

Z 
score 

0.67 

0.76 

0.84 

0.93 

1.04 

1.15 

1.28 

1.44 

1.65 

1.96 

2.58 

Kern  1 

11.49 

10.10 

9.55 

8.35 

7.37 

6.93 

6.23 

6.11 

5.45 

4.09 

3.77 

-0.989 

2 

11.77 

10.19 

9.10 

8.67 

7.33 

6.91 

6.22 

5.99 

4.73 

5.02 

3.35 

-0.983 

3 

11.01 

10.13 

9.46 

8.09 

7.48 

6.69 

6.38 

5.82 

4.66 

4.39 

3.18 

-0.994 

4 

11.68 

10.09 

9.14 

8.28 

7.90 

6.91 

6.26 

5.80 

4.72 

4.08 

3.14 

-0.994 

5 

11.67 

10.37 

9.10 

8.25 

7.49 

6.68 

6.29 

5.34 

5.47 

4.08 

3.35 

-0.988 

6 

11.54 

10.10 

9.52 

8.25 

7.87 

6.74 

6.60 

5.81 

4.67 

4.43 

3.24 

-0.993 

7 

11.69 

10.17 

9.43 

8.22 

7.87 

6.82 

5.67 

5.94 

4.81 

4.08 

3.36 

-0.990 

8 

11.53 

10.33 

9.60 

8.25 

7.89 

6.79 

5.78 

5.82 

4.68 

4.44 

3.27 

-0.992 

9 

11.28 

10.14 

9.07 

8.20 

7.42 

6.96 

6.22 

5.35 

4.79 

5.02 

3.37 

-0.987 

10 

11.75 

10.10 

9.15 

8.71 

7.46 

6.77 

5.67 

6.10 

5.54 

4.29 

3.33 

-0.982 

11 

11.59 

10.18 

9.11 

8.35 

7.30 

6.79 

6.54 

5.18 

5.59 

4.09 

3.34 

-0.986 

12 

11.54 

10.11 

9.43 

8.29 

7.45 

6.99 

6.23 

6.15 

4.81 

5.02 

3.37 

-0.985 

13 

11.73 

10.15 

9.45 

8.23 

7.89 

6.84 

5.68 

5.99 

4.79 

4.08 

3.36 

-0.990 

14 

11.57 

10.16 

9.54 

8.33 

7.43 

6.68 

5.77 

5.94 

5.45 

4.19 

3.28 

-0.986 

15 

11.70 

10.14 

9.54 

8.27 

7.33 

6.98 

6.70 

5.14 

4.66 

5.02 

3.36 

-0.982 

16 

11.66 

10.16 

9.48 

8.25 

7.35 

6.93 

6.25 

5.22 

4.70 

4.38 

3.36 

-0.992 

17 

11.53 

10.11 

9.43 

8.28 

7.45 

7.02 

6.23 

6.17 

4.82 

5.02 

3.36 

-0.985 

18 

11.67 

10.43 

9.14 

8.32 

7.39 

6.89 

5.67 

5.88 

4.70 

5.02 

3.33 

-0.980 

19 

11.53 

10.14 

9.60 

8.25 

7.90 

6.79 

5.79 

5.26 

4.68 

4.44 

3.28 

-0.992 

20 

11.92 

10.10 

9.16 

8.69 

7.45 

6.75 

5.67 

6.08 

5.52 

4.28 

3.32 

-0.980 

21 

11.94 

10.41 

9.61 

8.34 

7.45 

6.95 

6.22 

5.90 

4.67 

4.26 

3.34 

-0.991 

22 

11.54 

10.49 

9.43 

8.25 

7.32 

6.78 

6.51 

5.84 

4.77 

4.13 

3.23 

-0.992 

23 

11.58 

10.10 

9.48 

8.22 

7.51 

6.71 

6.54 

5.81 

4.66 

4.41 

3.22 

-0.991 

24 

11.73 

10.11 

9.58 

8.30 

7.33 

7.02 

6.30 

5.12 

4.67 

5.01 

3.35 

-0.983 

25 

11.62 

10.17 

9.48 

8.23 

7.46 

6.70 

6.28 

5.33 

5.46 

4.08 

3.36 

-0.989 

26 

11.90 

10.13 

9.17 

8.75 

7.86 

6.95 

6.24 

5.11 

4.70 

4.08 

3.12 

-0.994 

27 

11.56 

10.16 

9.53 

8.32 

7.42 

6.68 

5.76 

5.93 

5.46 

4.19 

3.27 

-0.986 

28 

11.55 

10.47 

9.42 

8.24 

7.33 

6.79 

6.48 

5.83 

4.81 

4.13 

3.23 

-0.992 

29 

11.46 

10.30 

9.31 

8.45 

7.60 

6.46 

5.74 

5.16 

4.54 

3.71 

2.22 

-0.996 

30 

11.55 

10.16 

9.17 

8.27 

7.40 

6.78 

5.78 

5.15 

4.50 

3.68 

2.49 

-0.996 

75 


Table  4-3:  Effectively  Independent  Interval  Counts,  60-item  test  length 


LCrCf 

u.ou 

U.3J 

u.ou 

U.D3 

U.  /  u 

0  80 

0  85 

0  90 

0  95 

0  99 

Corr 

7 

mm 

Score 

0  67 

0.76 

0.84 

0.93 

1.04 

1.15 

1.28 

1.44 

1.65 

1.96 

2.58 

Item  1 

15.98 

14.11 

12.59 

11.42 

10.40 

9.19 

8.55 

7.28 

6.83 

5.28 

4.44 

-0.994 

2 

15.83 

14.04 

12.71 

11.41 

11.09 

9.46 

8.23 

8.15 

6.85 

5.30 

4.34 

-0.994 

3 

15.86 

14.05 

12.68 

11.45 

10.39 

9.19 

8.11 

7.33 

6.32 

5.87 

4.68 

-0.993 

4 

15.95 

13.95 

12.57 

11.44 

10.09 

9.13 

8.23 

7.32 

6.83 

5.85 

4.35 

-0.991 

5 

15.96 

13.96 

12.53 

11.47 

10.20 

9.11 

8.54 

7.29 

6.33 

5.88 

4.39 

-0.993 

6 

15.94 

13.95 

12.67 

11.42 

10.45 

9.17 

8.54 

7.31 

6.14 

5.91 

4.44 

-0.993 

7 

15.86 

13.94 

12.74 

11.47 

10.09 

9.20 

8.56 

7.34 

6.84 

6.19 

4.36 

-0.991 

8 

15.96 

13.93 

12.70 

11.69 

10.22 

9.50 

8.55 

7.29 

6.85 

5.94 

4.52 

-0.993 

9 

15.99 

13.93 

12.67 

1143 

10.39 

9.22 

8.20 

7.70 

6.92 

5.28 

4.33 

-0.993 

10 

15.84 

14.12 

12.67 

11.44 

10.38 

9.16 

8.28 

7.69 

6.26 

5.32 

4.54 

-0.995 

11 

15.87 

13.92 

12.66 

11.43 

10.09 

9.43 

8.61 

7.28 

6.20 

5.85 

4.47 

-0.994 

12 

15.75 

13.93 

12.68 

1142 

10.40 

9.22 

8.16 

7.69 

6.42 

5.28 

4.34 

-0.996 

13 

15.84 

13.92 

12.79 

11.43 

10.09 

9.22 

8.54 

7.31 

6.83 

5.31 

4.38 

-0.994 

14 

16.01 

13.95 

12.49 

11.43 

10.19 

9.20 

8.54 

7.18 

7.08 

5.98 

4.47 

-0.989 

15 

15.98 

13.97 

12.66 

11.42 

10.09 

9.21 

8.54 

7.30 

6.38 

5.86 

5.50 

-0.988 

16 

15.95 

14.42 

12.67 

11.70 

10.13 

9.19 

8.54 

7.28 

6.86 

5.87 

5.38 

-0.989 

17 

15.76 

14.07 

12.69 

11.43 

1041 

9.21 

8.15 

7.69 

6.41 

5.28 

4.34 

-0.996 

18 

15.95 

14.14 

12.50 

11.46 

10.18 

943 

8.26 

7.29 

6.82 

5.37 

3.40 

-0.992 

19 

15.96 

13.93 

12.71 

11.46 

10.23 

9.10 

8.55 

7.28 

6.85 

5.95 

5.40 

-0.989 

20 

15.85 

14.10 

12.68 

11.38 

10.10 

9.19 

8.24 

7.69 

6.32 

5.33 

4.53 

-0.994 

21 

15.83 

13.94 

12.72 

11.44 

10.21 

9.18 

8.13 

7.70 

6.84 

5.97 

4.58 

-0.992 

22 

15.83 

13.94 

12.71 

11.41 

10.21 

9.45 

8.56 

7.29 

6.39 

5.90 

4.42 

-0.994 

23 

15.83 

14.00 

12.67 

11.43 

10.42 

9.18 

8.54 

7.35 

6.24 

5.89 

5.38 

-0.991 

24 

15.74 

14.05 

12.68 

11.49 

10.09 

9.19 

8.55 

7.25 

6.45 

5.85 

4.52 

-0.994 

25 

15.84 

13.86 

12.80 

11.43 

10.22 

9.14 

8.54 

7.73 

6.33 

5.89 

4.38 

-0.993 

26 

15.85 

14.06 

12.79 

11.71 

10.45 

9.19 

8.28 

7.71 

6.82 

5.86 

4.34 

-0.994 

27 

16.00 

13.96 

1248 

11.43 

10.19 

9.19 

8.55 

7.19 

6.33 

5.97 

4.47 

-0.991 

28 

15.83 

13.88 

12.69 

11.41 

10.17 

9.44 

8.11 

7.28 

6.41 

5.89 

4.42 

-0.993 

29 

15.87 

13.99 

12.55 

11.28 

10.16 

9.12 

8.14 

7.16 

6.07 

4.73 

3.24 

-0.996 

30 

15.96 

13.94 

12.68 

11.41 

10.19 

9.20 

8.26 

7.35 

6.45 

5.36 

4.41 

-0.995 

31 

15.94 

14.08 

12.44 

11.41 

10.38 

9.19 

8.54 

7.26 

6.83 

5.29 

4.40 

-0.994 

32 

15.99 

13.93 

12.68 

11.45 

10.13 

9.46 

8.55 

7.72 

6.26 

5.29 

4.36 

-0.995 

33 

16.20 

14.01 

12.67 

11.44 

10.56 

9.08 

8.29 

7.17 

6.40 

5.91 

4.52 

-0.991 

34 

15.89 

14.01 

12.80 

11.44 

10.39 

9.08 

8.14 

7.83 

6.33 

6.09 

4.46 

-0.991 

35 

15.95 

14.02 

12.68 

11.43 

10.16 

9.08 

8.14 

7.78 

6.87 

5.94 

4.49 

-0.990 

36 

15.98 

14.06 

12.73 

11.41 

10.42 

9.45 

8.80 

7.71 

6.85 

6.02 

5.36 

-0.992 

37 

16.00 

13.94 

12.58 

11.48 

10.39 

9.49 

8.77 

7.36 

6.84 

5.85 

4.36 

-0.993 

38 

15.96 

14.11 

12.74 

11.44 

10.23 

9.43 

8.23 

7.69 

6.34 

5.91 

4.34 

-0.993 

39 

15.95 

13.93 

12.68 

11.44 

10.14 

9.14 

8.55 

7.30 

6.87 

5.92 

4.50 

-0.991 

76 


Table  4-3  —  continued 


CI 
Level 

0.50 

0.55 

0.60 

0.65 

0.70 

0.75 

0.80 

0.85 

0.90 

0.95 

0.99 

Corr 

Z 

Score 

0.67 

0.76 

0.84 

0.93 

1.04 

1.15 

1.28 

1.44 

1.65 

1.96 

2.58 

40 

16.00 

13.94 

12.68 

11.45 

10.15 

9.46 

8.55 

7.72 

6.91 

5.95 

4.36 

-0.992 

41 

15.84 

13.93 

12.79 

11.44 

10.10 

9.23 

8.11 

7.31 

6.83 

5.32 

4.38 

-0.994 

42 

15.96 

14.15 

12.67 

11.47 

10.46 

9.21 

8.55 

7.88 

6.87 

5.87 

5.38 

-0.991 

43 

15.92 

13.94 

12.45 

11.43 

10.23 

9.43 

8.65 

7.37 

6.83 

5.95 

5.36 

-0.990 

44 

15.95 

14.10 

12.73 

11.41 

10.19 

9.43 

8.23 

7.70 

6.38 

5.87 

4.56 

-0.993 

45 

15.99 

13.97 

13.20 

11.43 

10.41 

9.21 

8.55 

7.82 

6.39 

5.86 

4.50 

-0.994 

46 

15.98 

14.08 

12.80 

11.46 

10.12 

9.46 

8.22 

7.69 

6.34 

5.85 

4.60 

-0.993 

47 

15.99 

14.06 

12.74 

11.41 

10.43 

9.44 

8.78 

7.72 

6.85 

6.02 

4.59 

-0.993 

48 

16.02 

14.14 

12.76 

11.44 

10.42 

9.16 

8.56 

7.29 

6.83 

5.28 

4.40 

-0.994 

49 

15.95 

14.10 

12.79 

11.45 

10.19 

9.44 

8.88 

7.70 

6.33 

5.90 

4.35 

-0.993 

50 

15.85 

14.12 

12.67 

11.44 

10.39 

9.57 

8.28 

7.69 

6.91 

6.15 

4.54 

-0.992 

51 

15.85 

14.09 

12.75 

11.70 

10.58 

9.48 

8.57 

7.29 

6.83 

5.90 

4.68 

-0.995 

52 

15.83 

13.91 

12.71 

11.74 

10.45 

9.51 

8.57 

7.81 

6.84 

5.29 

4.47 

-0.996 

53 

15.99 

14.11 

12.76 

11.45 

10.41 

9.23 

8.56 

7.33 

6.84 

5.33 

4.37 

-0.995 

54 

15.96 

14.11 

12.70 

11.45 

10.17 

9.12 

8.55 

7.93 

6.85 

5.88 

5.38 

-0.989 

55 

15.85 

13.89 

13.13 

11.94 

10.17 

9.44 

8.57 

7.28 

6.41 

5.89 

4.42 

-0.995 

56 

15.96 

14.10 

12.69 

11.45 

10.19 

9.12 

8.55 

7.56 

6.86 

5.94 

4.53 

-0.992 

57 

15.96 

13.77 

12.67 

11.50 

10.39 

9.52 

8.99 

7.70 

6.90 

5.30 

4.55 

-0.994 

58 

15.96 

14.09 

12.80 

11.43 

10.40 

9.23 

8.55 

7.31 

6.83 

5.31 

4.38 

-0.995 

59 

15.95 

14.34 

13.08 

11.84 

10.21 

9.46 

8.55 

7.82 

6.36 

5.92 

4.44 

-0.995 

60 

15.95 

14.22 

12.69 

11.42 

10.47 

9.49 

8.60 

7.32 

6.87 

5.39 

4.34 

-0.995 

77 


Rt=(%y — gfejg)  .  (4-14) 

These  values  are  generally  smaller  than  the  multiple  R2  values  derived  from 
statistical  software,  but  present  a  more  unbiased  picture  of  the  possible  utility  of 
the  models  in  the  population  of  interest. 


Results  of  Regression  Analyses 

The  true  item  parameters  are  included  in  all  multiple  regression  models 
as  independent  variables.  They  represent  data  generally  available  to 
practitioners,  and  are  important  descriptors  of  the  individual  items.  Regression 
models  using  only  the  item  parameters  to  predict  the  scaling  factor  and  effective 
degrees  of  freedom  are  statistically  significant  (p_  <  0.001 ),  but  account  for  a 
relatively  small  proportion  of  the  variance  in  the  data,  about  16.5%  for  the 
scaling  factor  and  18.6%  for  the  effective  degrees  of  freedom.  Clearly,  variables 
which  increase  the  predictive  effectiveness  of  the  model  are  needed.  In 
predicting  the  scaling  factor,  item  parameters  a  and  c  are  statistically  significant 
(p.  <  0.05).  In  predicting  the  effective  degrees  of  freedom,  only  item  parameter  c 
is  statistically  significant  (p.  <  0.05).  These  results  offer  partial  support  for 
research  hypotheses  1  and  2.  Complete  results  are  presented  in  Table  4-4. 

Results  of  the  multiple  regression  using  the  item  information  maximum 
value  and  the  item  parameters  as  independent  variables  and  the  effective 
degrees  of  freedom  as  the  dependent  variable  are  presented  in  Table  4-5. 


78 


Table  4-4:  Summary  of  Multiple  Regression  Analysis  for  Item  Parameters 
Predicting  Scaling  Factor  and  Effective  Degrees  of  Freedom  (N=90) 


Dependent 
Variable 

Reg.  Coeff. 
a 

[Std.  Error] 
(P) 

Reg.  Coeff. 
b 

[bid.  error] 

(p) 

Reg.  Coeff. 
c 

[ota.  trrorj 
(P) 

R5v 

{R2cv> 

Scaling 
Factor 

-0.0721 
[0.0278] 
(0.0111) 

-0.0032 
[0.0099] 
(0.7498) 

-0.4381 

[0.1272] 

(0.0009) 

0.1647 
{0.1453} 

Effective 
Degrees  of 
Freedom 

-0.1513 
[0.3711] 
(0.6845) 

0.1992 

[0.1318] 

(0.1345) 

7.9079 

[1.6991] 

(0.0001) 

0.1862 
{0.1673} 

Table  4-5:  Summary  of  Multiple  Regression  Analysis  for  Item  Parameters  and 
Information  Function  Maximum  Value  and  Item  Information  Variance  Predicting 
Effective  Degrees  of  Freedom  (N=90) 


Item 

Reg. 

Reg. 

Reg. 

Reg.  Coeff. 

R'v 

Information 

Coeff.  -  a 

Coeff.  -  b 

Coeff.  -  c 

-  item 

{R2cv> 

[Std. 

[Std. 

[Std. 

information 

Error] 

Error] 

Error] 

[Std.  Error] 

(P) 

(P) 

(P) 

(P) 

Maximum 

1 .4662 

0.1822 

4.0508 

-1.6779 

0.2065 

[0.9757] 

[0.1306] 

[2.7321] 

[0.9380] 

{0.1799} 

(0.1366) 

(0.1712) 

(0.1419) 

(0.0772) 

Variance 

0.7546 

0.1929 

6.4039 

-4.3920 

0.1891 

[0.8746] 

[0.1317] 

[2.1464] 

[3.8413] 

{0.1621} 

(0.3907) 

(0.1467) 

(0.0037) 

(0.2561) 

The  overall  model  is  statistically  significant  (p.  <  0.001 )  in  predicting  the 
dependent  variables.  However,  none  of  the  individual  variables  is  a  statistically 
significant  predictor  of  the  dependent  variable  at  a  =0.05  and  the  model 
accounts  for  only  about  21  %  of  the  variance  in  the  data.  Again,  this  model  offers 
only  a  small  advantage  over  the  model  using  only  the  item  parameters  as 


79 

independent  variables.  These  results  support  rejection  of  research  hypothesis 

3. 

Results  of  the  multiple  regression  using  the  item  information  function 
sample  variance  and  the  item  parameters  as  independent  variables  and  the 
effective  degrees  of  freedom  as  the  dependent  variable  are  also  presented  in 
Table  4-5.  While  the  overall  model  is  statistically  significant  (e  <  0.001),  only  the 
item  parameter  c  is  a  statistically  significant  predictor  of  the  dependent  variable 
at  a  =0.05  and  the  model  accounts  for  only  about  19%  of  the  variance  in  the 
data.  Again,  there  is  little  to  be  gained  by  the  addition  of  the  item  information 
variance  variable  to  the  model,  as  it  offers  very  small  gains  in  predictive 
effectiveness.  These  results  support  rejection  of  research  hypothesis  4. 

Results  of  the  multiple  regression  using  the  item  information  function 
maximum  value  and  the  item  parameters  as  independent  variables  and  the 
scaling  factor  as  the  dependent  variable  are  presented  in  Table  4-6.  While  the 
overall  model  is  statistically  significant  (p_  <  0.001),  none  of  the  individual 
variables  is  a  statistically  significant  predictor  of  the  dependent  variable  at 
a  =0.05  and  the  model  accounts  for  only  about  16%  of  the  variance  in  the  data. 
This  model  offers  only  a  very  small  increase  in  effectiveness  in  predicting  the 
scaling  factor  over  the  model  using  only  the  item  parameters.  These  results 
support  rejection  of  research  hypothesis  5.  Results  of  the  multiple  regression 
using  the  item  information  function  sample  variance  and  the  item  parameters  as 
independent  variables  and  the  scaling  factor  as  the  dependent  variable  are  also 


presented  in  Table  4-6.  While  the  overall  model  is  statistically  significant  (p.  < 
0.001),  only  the  item  parameter  a  is  a  statistically  significant  predictor  of  the 
dependent  variable  at  a  =0.05  and  the  model  accounts  for  only  about  18%  of  the 
variance  in  the  data.  This  model  also  offers  only  a  very  small  increase  in 
variance  accounted  for  in  predicting  the  scaling  factor  over  the  model  using  only 
the  item  parameters.  These  results  support  rejection  of  research  hypothesis  6. 


Table  4-6:  Summary  of  Multiple  Regression  Analysis  for  Item  Parameters  and 
Information  Function  Maximum  Value  and  Item  Information  Variance  Predicting 
Scaling  Factor  (N=90) 


Item 
Information 

Reg. 
Coeff.  -  a 
[Std. 
Error] 

(P) 

Reg. 
Coeff.  -  b 
[Std. 
Error] 
(P) 

Reg. 
Coeff.  -  c 
[Std. 
Error] 
(P) 

Reg.  Coeff. 

-  item 
information 
[Std.  Error] 

(P) 

R\ 

{R2cv} 

Maximum 

-0.1018 
[0.0743] 
(0.1745) 

-0.0028 
[0.0010] 
(0.7783) 

-0.3672 
[0.2082] 
(0.0813) 

0.0308 
[0.0715 
(0.6676) 

0.1567 
{0.1289} 

Variance 

-0.1698 
[0.0650] 
(0.0106) 

-0.0025 
[0.0098] 
(0.8000) 

-0.2758 
[0.1594] 
(0.0872) 

0.4737 

[0.2853] 

(0.1005) 

0.1814 
{0.1542} 

Results  of  the  multiple  regressions  using  the  interval  counts  and  the  item 
parameters  as  independent  variables  and  the  effective  degrees  of  freedom  are 
presented  in  Table  4-7.  In  all  cases,  the  overall  model  was  significant  (p_  < 
0.001),  and  the  item  parameter  a  was  not  a  statistically  significant  predictor 
variable.  Item  parameters  b  and  c  and  the  interval  counts  were  statistically 
significant  predictors  (p.  <  0.05).  Item  parameter  b  was  not  a  statistically 


81 

significant  predictor  of  effective  degrees  of  freedom  in  the  multiple  regression 
model  using  only  the  item  parameters  as  independent  variables.  The  variance 
accounted  for  by  the  model  ranges  between  a  low  of  41%  for  the  55% 
confidence  interval  level  to  a  high  of  49.7%  at  the  50%  confidence  interval  level. 
Most  of  the  models  account  for  approximately  48%  of  the  variance,  with  the  55%, 
95%,  and  99%  confidence  level  interval  counts  only  accounting  for  about  41%  of 
the  variance  in  the  data.  These  results  support  research  hypothesis  7. 

Results  of  the  multiple  regressions  using  the  interval  counts  and  the  item 
parameters  as  independent  variables  and  the  scaling  factor  as  the  dependent 
variable  are  presented  in  Table  4-8.  In  all  cases,  the  overall  model  was 
significant  (e  <  0.001),  and  the  item  parameter  b  was  not  a  statistically 
significant  predictor  variable,  as  in  the  model  using  only  the  item  parameters. 
Item  parameters  a  and  c  and  the  interval  counts  were  statistically  significant 
predictors  (2  <  0.05).    The  variance  accounted  for  by  the  model  ranges 
between  a  low  of  41%  for  the  99%  confidence  interval  level  to  a  high  of  58.5%  at 
the  65%  confidence  interval  level.  The  most  effective  models  appear  to  be  those 
that  include  the  interval  counts  between  the  60%  and  80%  confidence  levels. 
These  results  support  research  hypothesis  8. 

The  R2  values  for  the  multiple  regression  models  including  the  interval 
count  variables,  when  viewed  as  a  sequence  defined  by  the  confidence  levels 
from  50%  to  99%,  do  not  decrease  smoothly.  The  relationship  is  depicted 


82 


Table  4-7:  Summary  of  Multiple  Regression  Analysis  for  Item  Parameters  and 
Effectively  Independent  Interval  Count  Predicting  Effective  Degrees  of  Freedom 
(N=90) 


Confidence 
Level 

Reg. 
Coeff.  -  a 
[Std. 
Error] 
(P) 

Reg. 
Coeff.  -  b 
[Std. 
Error] 
(P) 

Reg. 
Coeff.  -  c 
[Std. 
Error] 
(P) 

Reg.  Coeff. 
-  intervals 
[Std.  Error] 

(P) 

R\ 

{R2cv} 

50% 

-0.2293 
[0.2920] 
(0.4346) 

0.2799 

[0.1042] 

(0.0087) 

5.8551 

[1.3647] 

(0.0001) 

0.4511 

[0.0613] 

(0.0001) 

0.4969 
{0.4792} 

55% 

-0.2672 
[0.3167] 
(0.4012) 

0.2676 

[0.1129] 

(0.0200) 

5.5474 
[1.5032] 
(0.0004) 

0.4133 

[0.0713] 

(0.0001) 

0.4098 
{0.3893} 

60% 

-0.2053 
[0.2930] 
(0.4854) 

0.2727 

[0.1045] 

(0.0108) 

5.6294 
[1.3771] 
(0.0001) 

0.5812 

[0.0798] 

(0.0001) 

0.4930 
{0.4752} 

65% 

-0.2368 
[0.2960] 
(0.4258) 

0.2812 

[0.1057] 

(0.0093) 

5.7826 

[1.3864] 

(0.0001) 

0.6063 

[0.0854] 

(0.0001) 

0.4834 
{0.4653} 

70% 

-0.2689 
[0.2942] 
(0.3632) 

0.2653 

[0.1047] 

(0.0131) 

6.1311 

[1.3668] 

(0.0001) 

0.6865 

[0.0949] 

(0.0001) 

0.4904 
{0.4725} 

75% 

-0.2068 
[0.2951] 
(0.4854) 

0.2882 

[0.1055] 

(0.0077) 

5.6152 

[1.3880] 

(0.0001) 

0.7772 

[0.1087] 

(0.0001) 

0.4859 
{0.4679} 

80% 

-0.2176 
[0.2941] 
(0.4615) 

0.2798 

[0.1050] 

(0.0092) 

5.7336 

[1.3790] 

(0.0001) 

0.8084 

[0.1120] 

(0.0001) 

0.4895 
{0.4717} 

85% 

-0.2590 
[0.3034] 
(0.3957) 

0.2840 

[0.1084] 

(0.0104) 

6.2767 

[1.4088] 

(0.0001) 

0.9453 

[0.1425] 

(0.0001) 

0.4576 
{0.4387} 

90% 

-0.4330 
[0.2966] 
(0.1480) 

0.2878 

[0.1051] 

(0.0075) 

5.8156 

[1.3767] 

(0.0001) 

1.0694 
[0.1482] 
(0.0001) 

0.4894 
{0.4716} 

95% 

-0.3163 
[0.3169] 
(0.3210) 

0.2822 

[0.1130] 

(0.0145) 

6.6614 

[1.4607] 

(0.0001) 

1.0862 
[0.1865] 
(0.0001) 

0.4115 
{0.3909} 

99% 

-0.3810 
[0.3153] 
(0.2302) 

0.3160 

[0.1128] 

(0.0063) 

6.6877 
[1 .4469] 
(0.0001) 

1.1448 
[0.1909] 
(0.0001) 

0.4215 
{0.4013} 

83 


Table  4-8:  Summary  of  Multiple  Regression  Analysis  for  Item  Parameters  and 
Effectively  Independent  Interval  Count  Predicting  Scaling  Factor  (N=90) 


Confidence 

Reg. 

Reg. 

Reg. 

Reg.  Coeff. 

R  v 

Level 

Coeff. 

Coeff. 

^\  mm 

Coeff. 

intervals 

{R  cv} 

a 

b 

c 

[Std.  brrorj 

[Std. 

[Std. 

[Std. 

(P) 

Error] 

Error] 

Error] 

(P) 

(P) 

(P) 

50% 

-0.0787 

0.0037 

-0.6121 

0.0383 

0.5765 

[0.0198] 

[0.0071] 

[0.0926] 

[0.0042] 

{0.5615} 

(0.0001) 

(0.6028) 

(0.0001) 

(0.0001 ) 

55% 

-0.0829 

0.0032 

-0.6569 

0.0383 

0.5219 

[0.0211] 

[0.0075] 

[0.1000] 

[0.0047] 

{0.5051} 

(0.0002) 

(0.6719) 

(0.0001) 

(0.0001 ) 

60% 

-0.0767 

0.0031 

-0.6331 

0.0497 

0.5791 

[0.0197] 

[0.0070] 

[0.0927] 

[0.0054] 

{0.5642} 

(0.0002) 

(0.6574) 

(0.0001) 

(0.0001) 

65% 

-0.0796 

0.0040 

-0.6240 

0.0530 

0.5845 

[0.0196] 

[0.0070] 

[0.0919] 

[0.0057] 

{0.5699} 

(0.0001) 

(0.5673) 

(0.0001 ) 

(U.UUU1  ) 

70% 

-0.0822 

0.0025 

-0.5904 

0.0589 

n  C771 

0.5773 

[0.0621] 

[0.0070] 

[0.0920] 

[0.0064] 

{0.5624} 

(0.0001 ) 

tr\  "700  A  \ 

(0.7221) 

(0.0001 ) 

(U.UUU1 ) 

75% 

-0.0769 

0.0046 

-0.6372 

0.0675 

0.5817 

[0.0197] 

[0.0070] 

[0.0925] 

[0.0072] 

tr\  cc7n\ 
(U.OO/ U) 

(0.0002) 

(0.5175) 

(0.UUU1  ) 

(U.UUU  l ) 

80% 

-0.0778 

0.0038 

-0.6253 

0.0696 

0.5796 

[0.0197] 

[0.0070] 

[0.0925] 

[0.0075] 

{0.5648} 

(0.0002) 

(0.5923) 

(0.0001) 

(0.0001) 

85% 

-0.081 1 

0.0039 

-0.5735 

0.0785 

0.5097 

[0.0213] 

[0.0076] 

[0.0990] 

[0.0100] 

{0.4924} 

(0.003) 

(0.6110) 

(0.0001) 

(0.0001) 

90% 

-0.0942 

0.0038 

-0.6018 

U.Uoo/ 

U.OUO/ 

[0.0216] 

[0.0076] 

[0.1001] 

[0.0108] 

{0.4884} 

(0.0001) 

(0.0001) 

(0.0001) 

(0.0001) 

95% 

-0.0866 

0.0041 

-0.5475 

0.09535 

0.4861 

[0.0219] 

[0.0078] 

[0.1009] 

[0.0129] 

{0.4681} 

(0.0002) 

(0.5979) 

(0.0001) 

(0.0001 ) 

99% 

-0.0894 

0.0056 

-0.5298 

0.0860 

0.4081 

[0.0236] 

[0.0084] 

[0.1082] 

[0.0143] 

{0.3875} 

(0.0003) 

(0.5068) 

(0.0001) 

(0.0001) 

graphically  in  Figure  4-1  for  both  scaling  factor  and  effective  degrees  of 
freedom.  The  jaggedness  is  particularly  pronounced  at  the  55%  confidence 
level,  possibly  indicating  a  problem  with  the  interval  counting  rules  most  visible 
at  this  particular  level. 

Influential  Observations 
Influence  statistics  were  examined  for  all  multiple  regression  models. 
Influential  observations  are  ones  that  appear  to  have  a  large  influence  on  the 
estimation  of  the  regression  parameters.  The  criteria  used  to  detect  influential 
observations  for  this  study  included  a  statistic  that  measured  the  change  in  the 
determinant  of  the  covariance  matrix  of  the  estimates  when  the  ith  item  was 
deleted  from  the  data  set,  and  a  scaled  measure  of  the  change  in  the  predicted 
value  for  the  P  observation  that  was  calculated  by  deleting  the  ith  observation 
from  the  data  set.  The  second  measure  is  very  similar  to  the  Cook's  D.  In 
addition,  a  scaled  measure  of  the  change  in  each  parameter  estimate  was 
examined,  to  locate  observations  that  are  influential  in  estimating  a  given 
parameter.  For  the  multiple  regression  models  predicting  either  the  scaling 
factor  or  the  effective  degrees  of  freedom,  the  models  that  included  (a)  only  the 
item  parameters,  (b)  the  item  parameters  and  the  item  information  maximum 
value,  or  (c)  the  item  parameters  and  the  item  information  variance  had  no 
influential  observations  contained  in  the  data  set.  The  same  was  true  for  the 
regression  models  predicting  the  scaling  factor  that  included  the  item 


Estimated  R2  by  Confidence  Level 
Predicting  Effective  Degrees  of  Freedom 


50% 


Population  Validity 

Population  Cross- 
Validity 


70% 


90% 


Confidence  Level 


Estimated  R2  by  Confidence  Level 
Predicting  Scaling  Factor 


50%         70%  90% 
C  o  n  f  id  e  n  c  e  Level 


Population  Cross- 
Valid  ity 

Population  Validity 


Figure  4-1 :  Scatterplots  of  Interval  Confidence  Level  versus  R2 


86 

parameters  and  all  levels  of  the  effectively  independent  interval  counts.  The 
regression  models  predicting  effective  degrees  of  freedom  that  included  the  item 
parameters  and  all  levels  of  the  effectively  independent  interval  counts  had  one 
observation,  item  number  60  from  the  60-item  test,  that  met  both  of  the  overall 
test  criteria  for  an  influential  observation.  This  observation  also  met  the  criterion 
for  influence  on  effective  degrees  of  freedom.  This  item  has  an  unusually  high  c 
parameter,  and  the  c  parameter  is  a  heavily-weighted  predictor  of  effective 
degrees  of  freedom  in  the  regression.  An  value  of  the  c  parameter  this  high  is 
unusual  in  real  data,  but  this  value  was  observed  in  the  operational  NAEP 
assessment.  This  may  be  the  source  of  the  influence  of  the  observation.  This 
item  was  removed  from  the  data  set  and  the  multiple  regressions  predicting 
effective  degrees  of  freedom  using  the  item  parameters  and  the  effectively 
independent  interval  counts  were  re-run.  The  results  are  presented  in  Table 
4-9. 

Testing  Significance  of  the  Increases  in  R2 
Because  the  effectiveness  of  the  predictor  variables  is  being  evaluated, 
the  change  in  the  amount  of  variance  accounted  for  by  the  model  when  specific 
predictors  are  included  in  the  model  was  of  interest.  The  basic  model  from 
which  the  change  was  measured  is  the  regression  model  using  only  the  item 
parameters  as  predictors  of  either  effective  degrees  of  freedom  or  the  scaling 
factor.  The  significance  of  the  change  can  be  tested  using  the  statistical  F 
distribution. 


87 


Table  4-9:  Summary  of  Multiple  Regression  Analysis  for  Item  Parameters  and 
Effectively  Independent  Interval  Count  Predicting  Effective  Degrees  of  Freedom 
After  Removal  of  an  Influential  Observation  (N=89) 


Confidence 
Level 

Reg. 
Coeff.  -  a 
[Std. 
Error] 
(P) 

Reg. 
Coeff.  -  b 
[Std. 
Error] 
(P) 

Reg. 
Coeff.  -  c 
[Std. 
Error] 

(P) 

Reg.  Coeff. 
-  intervals 
[Std.  Error] 
(P) 

Rsv 

{R2cv> 

50% 

-0.4388 
[0.2516] 
(0.0848) 

0.2588 

[0.0889] 

(0.0046) 

3.8564 

[1.2149] 

(0.0021) 

0.4474 

[0.0523] 

(0.0001) 

0.5266 
{0.5100} 

55% 

-0.4090 
[0.2514] 
(0.1075) 

0.2476 

[0.0888] 

(0.0065) 

3.7931 

[1.2157] 

(0.0025) 

0.5046 

[0.0589] 

(0.0001) 

0.5270 
{0.5104} 

60% 

-0.4181 

[0.2516] 

(0.1003) 

0.2518 

[0.0889] 

(0.0058) 

3.5930 

[1.2215] 

(0.0001) 

0.5799 

[0.0678] 

(0.0001) 

0.5263 
{0.5097} 

65% 

-0.4500 
[0.2548] 
(0.0810) 

0.2604 

[0.0901] 

(0.0049) 

3.7395 

[1.2337] 

(0.0032) 

0.6055 

[0.0727] 

(0.0001) 

0.5146 
{0.4975} 

70% 

-0.4714 
[0.2441] 
(0.0697) 

0.2441 

[0.0905] 

(0.0084) 

4.2027 

[1.2314] 

(0.0010) 

0.6729 

[0.0819] 

(0.0001) 

0.5085 
{0.4913} 

75% 

-0.4126 
[0.2569] 
(0.1120) 

0.2666 

[0.0910] 

(0.0044) 

3.6693 

[1.2469] 

(0.0042) 

0.7649 

[0.0937] 

(0.0001) 

0.5060 
{0.4887} 

80% 

-0.4244 
[0.2585] 
(0.1001) 

0.2585 

[0.0903] 

(0.0053) 

3.7707 

[1.2357] 

(0.0030) 

0.7975 

[0.0962] 

(0.0001) 

0.5126 
{0.4955} 

85% 

-0.4760 
[0.2623] 
(0.0732) 

0.2635 

[0.0927] 

(0.0057) 

4.1919 

[1.2601] 

(0.0013) 

0.9509 

[0.1219] 

(0.0001) 

0.4863 
{0.4683} 

90% 

-0.6363 
[0.2574] 
(0.0155) 

0.2663 

[0.0904] 

(0.0042) 

O  QCC7 

3.8567 
[1.2345] 
(0.0024) 

1  .UM1 

[0.1274] 
(0.0001) 

U.O  1  13 

{0.4947} 

95% 

-0.5469 
[0.2743] 
(0.0494) 

0.2631 

[0.0968] 

(0.0080) 

4.4605 

[1.3099] 

(0.0010) 

1.1222 
[0.1598] 
(0.0001) 

0.4416 
{0.4221} 

99% 

-0.6058 
[0.2740] 
(0.0297) 

0.2967 

[0.0971] 

(0.0030) 

4.5492 

[1.3022] 

(0.0008) 

1.1661 
[0.1641] 
(0.0001) 

0.4465 
{0.4272} 

88 

Table  4-10  includes  the  results  of  the  test  of  the  statistical  significance  of  the 
increase  in  variance  accounted  for  by  the  model  for  effective  degrees  of 
freedom,  and  Table  4-1 1  includes  the  same  for  models  predicting  scaling  factor. 
The  values  in  Tables  4-10  and  4-1 1  are  based  on  the  entire  data  set,  including 
the  single  influential  observation.  Removal  of  the  influential  observation 
changed  the  specific  test  values,  but  the  pattern  of  significance  in  the  test  results 
remained  the  same. 

The  addition  of  the  item  information  maximum  value  and  the  item 
information  variance  to  the  multiple  regression  model  including  only  the  item 
parameters  as  predictors  of  the  effective  degrees  of  freedom  or  of  the  scaling 
factor  results  in  increases  in  R2  that  are  not  statistically  significant.  The  addition 
of  the  effectively  independent  intervals  count  at  all  confidence  levels  measured 
to  the  multiple  regression  model  including  only  the  item  parameters  as  predictors 
of  the  effective  degrees  of  freedom  or  of  the  scaling  factor  results  in  increases  in 
R2  that  are  statistically  significant. 

These  increases  in  variance  accounted  for  by  the  multiple  regression 
models  indicate  that  the  interval  count  variable  is  potentially  useful  in  predicting 
the  scaling  factor  needed  to  decompress  the  item  fit  statistic  values  and  the 
effective  degrees  of  freedom  for  the  distribution  once  decompressed.  However, 
a  substantial  part  of  the  variance  in  both  dependent  variables  remains 
unaccounted  for  by  the  model,  indicating  that  more  research  is  required  to  locate 
additional  effective  independent  variables. 


Table  4-10  Test  of  the  Significance  of  Increase  in  R2  for  All  Multiple  Regression 
Models  Predicting  Effective  Degrees  of  Freedom. 


Independent 
Variables 

R  v 

{R2cv> 

-  -2 
'observed  ■»  v 

F  observed  {R  cv} 

mm 

rcritical(.06, 1,  86) 

Item  Parameters 

0.1862 
{0.1673} 

Item  Parameters 
+  Item  Info  Max 

0.2065 
{0.1799} 

2.1745 
{1.3054} 

3.9532 

Item  Parameters 
+  Item  Info  Var 

0.1891 
{0.1621} 

0.3040 
{-0.5279} 

3.9532 

Item  Parameters 
+  50%  CL  Interval 
Count 

0.4969 
{0.4792} 

52.4935 
{50.9161} 

3.9532 

Item  Parameters 
+  55%  CL  Interval 
Count 

0.4098 
{0.3893} 

32.2026 
{30.9039} 

3.9532 

Item  Parameters 
+  60%  CL  Interval 
Count 

0.4930 
{0.4752} 

51.4359 
{49.8787} 

3.9532 

Item  Parameters 
+  65%  CL  Interval 
Count 

0.4834 
{0.4653} 

48.9005 
{47.3668} 

3.9532 

Item  Parameters 
+  70%  CL  Interval 
Count 

0.4904 
{0.4725} 

50.7398 
{49.1867} 

3.9532 

Item  Parameters 
+  75%  CL  Interval 
Count 

0.4859 
{0.4679} 

49.5516 
{48.0160} 

3.9532 

Item  Parameters 
+  80%  CL  Interval 
Count 

0.4895 
{0.4717} 

50.5005 
{48.9668} 

3.9532 

Item  Parameters 
+  85%  CL  Interval 
Count 

0.4576 
{4387} 

42.5313 
{41.0890} 

3.9532 

Item  Parameters 
+  90%  CL  Interval 
Count 

0.4894 
{0.4716} 

50.4740 
{48.9393} 

3.9532 

Item  Parameters 
+  95%  CL  Interval 
Count 

0.4115 
{0.3909} 

32.5412 
{31.2126} 

3.9532 

Item  Parameters 
+  99%  CL  Interval 
Count 

0.4215 
{0.4013} 

34.5730 
{33.2281} 

3.9532 

90 


Table  4-1 1  Test  of  the  Statistical  Significance  of  Increase  in  R2  for  All  Multiple 
Regression  Models  Predicting  Scaling  Factor. 


Independent 
Variables 

R2V 

{R2cv> 

 2 

^observed  R  v 

Pobierved  {R  cv} 

F  critical).  05,  1,  86) 

Item  Parameters 

0.1647 
{0.1453} 

Item  Parameters 
+  Item  Info  Max 

0.1567 
{0.1289} 

-0.8064 
{-1.5978} 

3.9532 

Item  Parameters 
+  Item  Info  Var 

0.1814 
{0.1542} 

1.7341 
{0.8919} 

3.9532 

Item  Parameters 
+  50%  CL  Interval 
Count 

0.5765 
{0.5615} 

82.6517 
{80.6922} 

3.9532 

Item  Parameters 
+  55%  CL  Interval 
Count 

0.5219 
{0.5051} 

63.5055 
{61.8016} 

3.9532 

Item  Parameters 
+  60%  CL  Interval 
Count 

0.5791 
{0.5642} 

83.6873 
81.7218 

3.9532 

Item  Parameters 
+  65%  CL  Interval 
Count 

0.5845 
{0.5699} 

85.8797 
{83.9051} 

3.9532 

Item  Parameters 
+  70%  CL  Interval 
Count 

0.5773 
{0.5624} 

82.9690 
{81.0203} 

3.9532 

Item  Parameters 
+  75%  CL  Interval 
Count 

0.5817 
{0.5670} 

84.7358 
{82.7644} 

3.9532 

Item  Parameters 
+  80%  CL  Interval 
Count 

0.5796 
{0.5648} 

83.8880 
{81.9293} 

3.9532 

Item  Parameters 
+  85%  CL  Interval 
Count 

0.5097 
{0.4924} 

59.8103 
{58.1378} 

3.9532 

Item  Parameters 
+  90%  CL  Interval 
Count 

0.5057 
{0.4884} 

58.6385 
{57.0169} 

3.9532 

Item  Parameters 
+  95%  CL  Interval 
Count 

0.4861 
{0.4681} 

53.1601 
{51.5833} 

3.9532 

Item  Parameters 
+  99%  CL  Interval 
Count 

0.4081 
{0.3875} 

34.9535 
{34.9535} 

3.9532 

91 

Evaluating  the  Regression  Models 

As  a  measure  of  the  effectiveness  of  the  regression  model,  the  predicted 
values  of  the  scaling  factor  and  the  effective  degrees  of  freedom  were  used  to 
decompress  the  data  and  determine  the  predicted  values  for  degrees  of  freedom 
of  the  reference  distribution.  The  distributions  decompressed  with  the  scaling 
factor  and  effective  degrees  of  freedom  derived  from  the  sample  moments  from 
Phase  1  of  the  study  were  used  to  make  a  comparison  of  the  rate  of  occurrence 
of  statistically  significant  values  in  the  distributions.  A  nominal  rate  of  5%  was 
chosen  for  the  comparison.  A  measure  of  agreement  that  corrects  for  the 
chance  agreement  between  two  evaluations  of  the  same  value,  kappa,  was  also 
calculated.  Two  regression  estimates  of  the  effective  degrees  of  freedom  were 
generated,  with  and  without  the  influential  observation  discussed  previously. 
Complete  results  of  the  comparison  for  all  items  are  presented  in  Table  4-12. 

Although  the  average  rate  of  statistically  significant  values  across  all 
items  for  the  distributions  decompressed  with  the  scaling  factor  and  effective 
degrees  of  freedom  derived  from  the  sample  moments  is  close  to  the  nominal 
rate,  the  values  vary  widely  from  the  nominal  rate  for  individual  items.  The 
regression  estimates  using  all  observations  have  rates  have  less  variance  in  the 
individual  item  values,  but  the  average  rate  across  all  items  is  farther  from  the 
nominal  rate.  Removal  of  the  influential  observation  reduces  the  variance  of  the 
rates  for  the  regression  estimates  farther  and  causes  the  average  rate  to  be 
lower  than  the  nominal. 


Table  4-12  Rate  of  Significant  Test  Values  and  Level  of  Agreement 


All  items 

Without  Item  60 

Item 

N 

Nominal 
Rate 

Empirical 
Rate 

Regression 
Rate 

Kappa 

Regression 
Rate 

Kappa 

1 

2000 

5.00 

8.85 

4.55 

0.29 

2.75 

0.24 

2 

2000 

5.00 

1.90 

3.60 

0.68 

1.70 

0.42 

3 

2000 

5.00 

9.85 

4.65 

0.23 

2.80 

0.19 

4 

2000 

5.00 

1.20 

7.90 

0.25 

4.65 

0.40 

5 

2000 

5.00 

11.85 

5.00 

0.29 

3.45 

0.24 

6 

2000 

5.00 

1.55 

6.70 

0.36 

4.30 

0.52 

7 

2000 

5.00 

12.95 

5.55 

0.51 

3.50 

0.39 

8 

2000 

5.00 

1.10 

4.20 

0.41 

2.60 

0.59 

9 

2000 

5.00 

10.55 

4.95 

0.20 

3.00 

0.14 

10 

2000 

5.00 

2.10 

9.05 

0.36 

6.40 

0.48 

11 

2000 

5.00 

6.90 

3.20 

0.38 

2.25 

0.32 

12 

2000 

5.00 

0.60 

3.65 

0.28 

2.25 

0.42 

13 

2000 

5.00 

6.35 

5.60 

0.65 

3.70 

0.62 

14 

2000 

5.00 

0.65 

9.45 

0.12 

6.65 

0.17 

15 

2000 

5.00 

14.35 

4.40 

0.33 

2.70 

0.27 

16 

2000 

5.00 

1.65 

5.40 

0.45 

3.70 

0.61 

17 

2000 

5.00 

10.35 

8.15 

0.55 

5.95 

0.50 

18 

2000 

5.00 

1.35 

5.85 

0.36 

3.95 

0.50 

19 

2000 

5.00 

17.60 

9.15 

0.64 

6.50 

0.49 

20 

2000 

5.00 

1.45 

5.85 

0.38 

3.45 

0.58 

21 

2000 

5.00 

8.00 

2.45 

0.45 

1.75 

0.34 

22 

2000 

5.00 

1.35 

4.20 

0.48 

2.65 

0.67 

23 

2000 

5.00 

11.55 

4.65 

0.46 

3.30 

0.40 

24 

2000 

5.00 

1.30 

5.35 

0.38 

3.55 

0.53 

25 

2000 

5.00 

5.65 

5.85 

0.90 

4.10 

0.83 

26 

2000 

5.00 

0.60 

6.15 

0.17 

3.75 

0.27 

27 

2000 

5.00 

8.50 

6.00 

0.72 

3.65 

0.58 

28 

2000 

5.00 

1.95 

4.80 

0.57 

3.35 

0.44 

29 

2000 

5.00 

10.70 

2.20 

0.32 

1.30 

0.20 

30 

2000 

5.00 

1.10 

9.50 

0.19 

6.70 

0.27 

31 

1000 

5.00 

3.90 

4.10 

0.97 

2.40 

0.76 

32 

1000 

5.00 

1.70 

4.40 

0.55 

2.10 

0.89 

33 

1000 

5.00 

2.70 

3.40 

0.88 

2.30 

0.92 

34 

1000 

5.00 

1.70 

5.70 

0.45 

3.40 

0.66 

35 

1000 

5.00 

1.00 

5.30 

0.31 

3.20 

0.47 

36 

1000 

5.00 

2.20 

8.70 

0.38 

5.40 

0.57 

37 

1000 

5.00 

1.60 

5.00 

0.47 

3.60 

0.61 

38 

1000 

5.00 

3.70 

4.90 

0.85 

3.10 

0.91 

93 


Table  4-12  -  continued 


39 

1000 

5.00 

2.70 

10.60 

0.38 

-7  OO 

7.30 

0.52 

40 

1000 

5.00 

1.60 

6.40 

A  Aft 

0.38 

4.10 

O.OO 

41 

1000 

5.00 

3.70 

4.60 

0.89 

3.30 

0.94 

42 

1000 

5.00 

3.30 

4.70 

0.82 

O  OO 

3.30 

1 .00 

43 

1000 

5.00 

8.00 

9.70 

/~v  nn 

0.90 

•740 

7.10 

0.94 

44 

1000 

5.00 

2.40 

3.00 

0.89 

1.90 

n  qq 

O.OO 

45 

1000 

5.00 

2.90 

4.40 

0.79 

3.10 

0.97 

46 

1000 

5.00 

4.40 

4.80 

0.95 

n  n  r\ 

3.20 

0.84 

47 

1000 

5.00 

7.40 

8.60 

0.92 

6.70 

0.95 

48 

1000 

5.00 

4.10 

4.60 

0.94 

3.50 

0.92 

49 

1000 

5.00 

3.00 

5.40 

0.70 

n  r\r\ 

3.00 

1 .00 

50 

1000 

5.00 

6.60 

8.80 

0.85 

6.80 

r»  00 

o.ao 

51 

1000 

5.00 

2.70 

4.70 

n  ^n 

0.72 

0  00 

3.00 

0.95 

52 

1000 

5.00 

4.10 

6.00 

o  nn 

0.80 

4.10 

1 .00 

53 

1000 

5.00 

4.60 

8.80 

+\  n^ 

0.67 

5.70 

A  OO 

0.89 

54 

1000 

5.00 

5.50 

7.70 

n  nn 

0.82 

C  nn 

5.60 

r\  qq 
0.99 

55 

1000 

5.00 

6.30 

9.90 

0.76 

6.80 

0.9b 

56 

1000 

5.00 

3.20 

r"  c  o 

5.50 

0.72 

O.40 

u.y  / 

57 

1000 

5.00 

3.60 

7  OO 

7.20 

U.DO 

o.ou 

U.  (  0 

oft 

mnn 

5  00 

yj .  WW 

2  80 

.  WW 

6.50 

0.59 

4.60 

0.75 

59 

1000 

5.00 

2.50 

9.50 

0.39 

6.80 

0.52 

60 

1000 

5.00 

2.20 

12.70 

0.27 

10.10 

0.33 

Max 

17.60 

12.70 

0.97 

10.10 

1.00 

Min 

0.60 

2.20 

0.12 

1.30 

0.14 

Mean 

4.67 

6.06 

0.55 

4.08 

0.62 

Var 

15.40 

5.02 

0.06 

3.13 

0.07 

CHAPTER  V 
DISCUSSION  AND  CONCLUSIONS 


Summary  of  Procedures  and  Findings 

This  study  was  conducted  in  two  phases.  Phase  1  abstracted  the  item  fit 
statistics,  calculated  using  an  IRT  model  using  pseudocounts,  from  a  large 
simulated  data  set  created  at  ETS.  Since  the  pseudocounts  model  violates  one 
of  the  important  assumptions  underlying  the  use  of  a  chi-square  distribution  as 
the  reference  distribution,  the  properties  of  the  fit  statistic  distribution  could  not 
be  assumed  and  were  investigated  in  phase  1 .  Previous  studies  have  indicated 
that  item  fit  statistic  values  form  a  distribution  that  is  a  member  of  the  chi-square 
family  of  distributions,  although  the  data  are  compressed  and  have  degrees  of 
freedom  considerably  less  than  expected,  a  result  confirmed  in  phase  1  of  this 
study.  The  scaling  factor  needed  to  remove  the  compression  and  the  effective 
degrees  of  freedom  used  to  select  the  reference  distribution  were  calculated 
from  the  sample  moments  of  the  fit  statistic  distributions.  The  values  of  the 
scaling  factor  and  effective  degrees  of  freedom  served  as  dependent  variables 
in  phase  2  of  the  study. 

In  phase  2,  various  independent  variables  were  assessed  for  their 
effectiveness  in  predicting  the  scaling  factor  and  effective  degrees  of  freedom. 


94 


95 

The  use  of  sample  moments  of  the  fit  statistic  distribution  to  estimate  these 
values,  as  done  in  phase  1 ,  is  not  possible  in  most  testing  situations.  The 
variables  examined  in  phase  2  of  the  study  as  predictors  included  the  item 
parameters,  the  maximum  of  the  item  information  function  values,  the  variance  of 
a  sample  of  item  information  function  values,  and  the  number  of  effectively 
independent  intervals  for  various  confidence  levels.  A  previous  study  (Stone,  et 
al.,  1994)  found  that  regression  using  the  mean  of  the  posterior  variance  of 
examinee  theta  estimates  as  the  predictor  of  the  effective  degrees  of  freedom 
and  scaling  factor  resulted  in  R2  values  above  0.90.  This  variable  was  not  used 
in  this  study.  One  objective  of  this  study  was  to  use  independent  variables  that 
were  reasonably  mathematically  tractable  and  derivable  from  information  readily 
available  to  practitioners.  The  posterior  variance  of  examinee  theta  estimates 
values  are  very  difficult  to  calculate  and  thus  do  not  meet  this  criterion. 

Effectively  independent  interval  counts  were  performed  for  various 
confidence  levels,  ranging  from  50%  to  99%,  for  items  on  tests  having  length  30 
and  60  items.  These  intervals  were  defined  by  selecting  a  central  theta  value  for 
an  interval,  beginning  with  the  point  that  provided  the  maximum  information  for 
an  item,  and  creating  an  interval  around  the  central  point  on  the  theta  scale 
equal  to  the  width  of  the  confidence  interval.  These  intervals  were  calculated  so 
that  two  criteria  were  met:  a)  that  the  interval  endpoints  not  overlap,  and  b)  that 
the  interval  endpoints  be  adjacent.  Adjacency  was  defined  as  points  that  were 
no  more  than  0.001  units  apart  on  the  theta  scale.  If  the  criterion  of  non- 


96 

overlapping  could  not  be  met,  as  was  the  case  for  several  items  in  the  intervals 
closest  to  the  bounds  of  the  theta  range,  a  set  of  rules  was  developed  and 
followed  to  account  for  the  remaining  portion  in  the  total  interval  count.  The 
bounds  of  the  theta  range  were  set  at  -4.0  and  4.0  respectively. 

The  interval  counts  for  all  confidence  levels  were  used  as  independent 
variables,  along  with  the  item  parameters,  in  multiple  regression  models  used  to 
predict  the  values  of  a  scaling  factor  used  to  decompress  item  fit  statistic  values 
and  the  effective  degrees  of  freedom  of  the  decompressed  item  fit  statistic 
distribution.  The  dependent  variable  values  used  in  this  study,  the  scaling 
factors  and  effective  degrees  of  freedom,  were  taken  from  phase  1  of  this  study. 

Values  for  the  maximum  item  information  for  each  item  were  calculated 
using  formula  4-1 .  The  variance  of  the  item  information  values  for  each  item  was 
estimated  by  generating  a  sample  of  theta  values  about  the  point  on  the  theta 
scale  that  provided  the  maximum  item  information.  Item  information  values  were 
calculated  for  each  of  these  theta  values,  and  the  variance  of  this  sample  of  item 
information  values  was  then  calculated.  The  item  information  maximum  values 
and  the  item  information  variance  values  were  each  used,  along  with  the  item 
parameters,  in  multiple  regression  models  used  to  predict  the  values  of  the 
scaling  factor  needed  to  decompress  item  fit  statistic  values  and  the  effective 
degrees  of  freedom  of  the  decompressed  item  fit  statistic  distribution. 

The  effectively  independent  interval  counts  were  statistically  significant 
predictors  of  the  dependent  variables,  adding  substantially  to  the  amount  of 


97 

variance  in  the  data  accounted  for  by  the  model  over  the  variance  accounted  for 
using  only  the  item  parameters  as  dependent  variables.  This  held  true  across 
confidence  levels,  although  the  models  using  the  95%  and  99%  levels  were 
somewhat  less  effective  than  the  other  levels.  The  efficacy  of  the  regression 
model  was  improved  slightly  by  the  removal  of  an  influential  observation  from  the 
data.  The  item  has  an  unusually  large  c  parameter  value,  but  since  the  item 
parameters  were  drawn  from  test  data,  it  is  not  unrealistic.  Results  were 
reported  both  with  and  without  the  use  of  this  observation.  The  item  information 
maximum  value  and  the  item  information  sample  variance  proved  to  be  poor 
predictor  variables  for  either  dependent  variable,  adding  little  to  the  multiple 
regression  model  using  only  the  item  parameters. 

Some  changes  to  the  procedures  used  in  this  study  may  alter  the  results. 
It  is  possible  that  a  different  method  of  measuring  of  the  item  information 
variance  would  result  in  a  more  useful  predictor  variable.  The  rules  used  to 
include  the  remaining  range  of  theta  values  in  the  interval  count  when  the  end 
interval  endpoint  failed  to  meet  the  non-overlapping  criterion  may  require 
refinement.  In  several  instances,  the  end  intervals  as  counted  by  the  rules  used 
in  this  study  had  a  value  counted  toward  the  total  that  was  greater  than  1 .0. 
This  may  have  given  these  end  intervals  too  much  impact  on  the  total  interval 
counts,  and  thus  affected  the  results  of  the  multiple  regression.  As  the  number 
of  effectively  independent  intervals  is  based  on  the  test  information,  the  intervals 
near  the  maximum  of  the  test  information  are  likely  to  be  similar  for  all  items  on 


the  test.  The  largest  differences  in  effectively  independent  intervals  are 
observed  near  the  extreme  values  of  the  range,  and  thus  the  method  used  to 
determine  the  weight  of  those  intervals  is  of  importance. 

Discussion 

The  results  of  the  data  analysis  in  Phase  1  of  this  study  confirm  results 
from  previous  studies.  The  empirical  data  distributions  do  indeed  seem  to  be 
from  the  chi-square  family,  as  was  believed.  The  data  compression  and  greatly- 
reduced  degrees  of  freedom  observed  in  the  data  are  also  in  agreement  with 
previous  results.  These  results  have  been  observed  under  other  simulation 
conditions,  including  those  done  using  1  PL  models.  The  assumption  that  the  fit 
index  value  is  drawn  from  a  member  of  the  chi-square  family  of  distributions  has 
been  supported,  but  the  assumption  that  the  distribution  is  well  fit  by  a 
theoretical  chi-square  distribution  with  degrees  of  freedom  determined  using  the 
traditional  algorithm  has  not.  There  is  continued  support  for  a  consensus  as  to 
the  effects  of  using  the  pseudocounts  method  on  the  fit  index  values,  but  the 
underlying  causes  of  the  problem  are  not  clear. 

The  results  of  Phase  2  indicate  that  it  is  possible  to  predict,  with  limited 
success,  the  scaling  factor  and  effective  degrees  of  freedom  using  the  item 
parameters  and  effectively  independent  interval  counts.  However,  the 
independent  variables  do  not  provide  an  explanation  for  a  significant  portion  of 
the  variance  observed  in  the  dependent  variables.  The  relationship  of  the 


99 

effectively  independent  interval  counts  with  the  dependent  variables  is  strong 
enough  to  warrant  further  investigation,  but  it  does  not  provide  sufficient 
information  to  allay  concerns  about  the  dependability  of  the  fit  index  values. 

Though  the  ability  to  accurately  and  effectively  predict  these  values  can 
provide  a  means  of  increasing  the  utility  of  the  fit  measure,  until  the  causes  of 
the  phenomena  are  well-understood,  complete  confidence  in  the  detection  of 
item  misfit  on  the  part  of  users  is  unlikely.  The  use  of  a  theoretical  model 
incorporating  pseudocounts,  while  intuitively  sound,  has  had  statistical  effects  on 
the  item  fit  index  that  are  undesirable.  The  problems  observed  in  the  fit 
measure,  such  as  the  reduction  in  the  degrees  of  freedom  and  the  compressed 
data  distribution,  may  be  only  an  indication  of  more  far-reaching  and  subtler 
implications  of  using  this  method.  It  was  believed  that  the  use  of  the 
pesudocounts  was  the  primary  source  of  the  unexpected  behavior  of  the  fit 
statistic  distribution,  but  the  results  of  this  study  do  not  provide  strong  evidence 
that  this  is  the  case.  Item  fit  statistics  not  incorporating  the  use  of  pseudocounts 
have  also  proved  ineffective  in  detecting  item  misfit,  suggesting  that  studies 
assessing  the  distributional  properties  of  other  fit  measures  may  increase 
understanding  of  the  problem. 

The  pseudocounts  method  seems  to  provide  a  more  realistic  depiction  of 
examinee  ability  estimation,  as  it  incorporates  errors  of  estimation  reasonably 
expected  to  be  present.  Indeed,  a  frequently-cited  advantage  of  the  IRT  model 
is  the  ability  to  measure  the  degree  of  error  in  ability  estimates  across  the  ability 


100 

scale,  rather  than  assume  that  the  measurement  error  is  constant.  This 
advantage  has  not  been  utilized  in  other  measures  of  model-data  fit.  But 
incorporation  of  the  information  about  errors  of  estimation  has  come  with  a  price. 
Known  violations  of  assumptions  of  the  chi-square  test  caused  researchers  to 
examine  data  properties  for  the  item  fit  measure,  with  surprising  results.  It  is 
possible  that  the  unexpected  alterations  in  the  properties  of  the  fit  measure  are 
indicative  of  other  changes  in  the  data  resulting  from  the  use  of  this  model. 

Directions  for  Future  Research 

Several  lines  of  research  related  to  this  study  will  be  pursued.  A  set  of 
predictor  variables  which  more  effectively  model  the  dependent  variables  will  be 
sought.  This  search  will  incorporate  both  modifications  to  the  construction  of  the 
independent  variables  used  in  this  study  and  new  dependent  variables.  As  the 
NAEP  program  does  not  provide  individual  scores,  the  assessments  are 
designed  to  effectively  estimate  the  performance  of  groups.  This  approach  has 
influence  on  the  items  chosen  for  the  assessment  and  the  design  of  the  test,  and 
it  is  possible  that  data  from  other  large-scale  testing  programs  would  provide 
different  results  or  new  insight  into  the  problem  of  determining  model-data  fit. 

The  effectiveness  of  the  multiple  regression  models  contained  herein  will 
be  assessed  further.  Tests  of  other  lengths  and  item  parameters  properties  also 
will  be  assessed  to  determine  if  the  properties  of  the  multiple  regression  hold 
under  other  conditions  and  on  other  populations.  An  assumption  was  made  for 


101 

this  study  that  ability  is  normally  distributed  in  the  population.  The  robustness  of 
the  methodology  to  violations  of  normality  is  unknown  and  must  be  assessed. 

The  items  comprising  the  two  tests  cover  a  large  range  of  difficulty, 
discrimination,  and  pseudo-guessing  parameter  values.  However,  the  two  tests 
are  similar  in  overall  difficulty  and  provide  maximum  information  about 
examinees  at  similar  points  on  the  ability  scale.  Tests  and  items  that  provide 
maximum  information  at  other  portions  of  the  theta  scale  may  behave  in  very 
different  ways,  and  the  methodology  described  here  has  not  yet  been  examined 
in  these  contexts.  Data  derived  from  examinations  designed  to  provide 
maximum  information  near  cutoff  points,  such  as  those  used  in  professional 
licensure  and  certification,  should  be  examined.  In  addition,  only  true  parameter 
values  were  used  in  this  study.  The  effect  of  parameter  estimation  and  the 
resulting  errors  of  estimation  on  the  technique  is  unknown.  Studies 
incorporating  item  parameter  estimates  into  the  regression  models,  in  place  of 
the  true  item  parameter  values  used  herein,  will  help  to  determine  the  effects  of 
inclusion  of  errors  of  estimation  on  the  regression. 

The  standard  error  of  measurement  of  the  test  was  used  to  calculate  the 
width  of  the  effectively  independent  intervals  in  this  study.  A  more  Bayesian 
approach  to  determining  interval  width  might  also  prove  to  be  of  interest. 
Instead  of  using  the  test  standard  error  values,  the  standard  error  of  the  estimate 
of  examinee  ability  at  various  points  on  the  scale  could  be  used  to  develop 


102 

interval  widths.  The  precision  of  measurement  of  theta  differs  across  the  scale, 
and  the  variable  may  prove  a  valuable  predictor. 

All  relationships  tested  in  this  study  were  assumed  to  be  linear  in  form. 
High  linear  correlation  values  are  not  conclusive  evidence  about  the  form  of  the 
relationship,  so  other  forms  of  relationship  between  variables  must  be  assessed 
before  linearity  is  accepted.  It  is  possible  that  some  or  all  of  the  relationships 
are  actually  non-linear,  so  models  utilizing  non-linear  relationships  should  be 
assessed.  It  is  also  possible  that  multivariate  modeling,  such  as  structural 
equation  modeling,  would  prove  more  effective  in  describing  the  actual 
relationships  between  the  variables,  as  only  univariate  relationships  were 
considered  in  this  study.  Also,  only  dichotomously  scored  items  were  used  in 
this  study.  The  result  of  using  these  methods  on  graded  response  and  partial- 
credit  response  data  is  unknown. 

The  results  of  this  study  offer  some  support  for  further  study  of  the  use  of 
effectively  independent  intervals  as  a  predictor  of  the  scaling  factor  and  effective 
degrees  of  freedom  for  the  item  fit  statistic  considered  here.  However,  additional 
or  improved  predictor  variables  will  be  necessary  to  account  for  the  remaining 
variance  in  the  data  before  truly  effective  prediction  of  these  variables  becomes 
possible.  Routine  use  of  item  fit  statistics  to  determine  the  need  for  possible 
revision  or  removal  of  items  from  pools  will  only  occur  when  practitioners  have 
confidence  in  the  reliability  and  validity  of  significant  misfit  as  indicated  by  the 
test.  Until  that  time,  decisions  must  continue  to  be  made  on  other  bases. 


APPENDIX 

CORRELATION  MATRIX  OF  DEPENDENT  AND  INDEPENDENT  VARIABLES 


Pearson  Correlation  Coefficients 
Prob  >  |R|  under  Ho:  Rho=0 
N  =  90 

Scaling 
Factor 


Effective 
df 


50%INTS 


0.53591 
0.0001 


0.61410 
0.0001 


55%INTS 


0.53962 
0.0001 


0.62382 
0.0001 


60%INTS 


0.53007 
0.0001 


0.61983 
0.0001 


65%INTS 


0.53468 
0.0001 


0.60537 
0.0001 


70%INTS 


0.54312 
0.0001 


0.60320 
0.0001 


75%INTS 


0.52901 
0.0001 


0.61088 
0.0001 


80%INTS 


0.53292 
0.0001 


0.61215 
0.0001 


85%INTS 


0.49334 
0.0001 


0.56716 
0.0001 


103 


90%INTS 
95%INTS 
99%INTS 
INFOVAR 
INFOMAX 


0.44783 
0.0001 

0.47499 
0.0001 

0.39670 
0.0001 

-0.07135 
0.5039 

-0.06986 
0.5129 


0.60271 
0.0001 

0.51101 
0.0001 

0.50764 
0.0001 

-0.18062 
0.0885 

-0.25346 
0.0159 


REFERENCES 


Allen,  N.,  Kline,  R.,  &  Zelenak,  D.  (1996).  The  NAEP  1 994  technical 
report.  Washington,  DC:  National  Center  for  Educational  Statistics. 

Allen,  N.,  Swinton,  S.,  Isham,  S.,  &  Zelenak,  D.  (1998).  The  NAEP  1996 
state  assessment  program  in  science.  Washington,  DC:  National  Center  for 
Educational  Statistics. 

Bishop,  Y.  M.  M.,  Fienberg,  S.  E.,  &  Holland,  P.  W.  (1975).  Discrete 
multivariate  analysis.  Cambridge,  MA:  The  MIT  Press. 

Bock,  R.  D.  (1972).  Estimating  item  parameters  and  latent  ability  when 
responses  are  scored  in  two  or  more  nominal  categories.  Psychomethka,  37, 
29-51. 

Bock,  R.  D.  (1997).  A  brief  history  of  item  response  theory.  Educational 
Measurement:  Issues  and  Practice,  16,  4,  21-33. 

Bock,  R.  D.,  &  Jones,  L.  V.  (1968).  The  measurement  and  prediction  of 
judgment  and  choice.  San  Francisco,  CA:  Holden-Day. 

Bock,  R.  D.,  &  Lieberman,  M.  (1970).  Fitting  a  response  model  for  n 
dichotomously  scored  items.  Psychomethka,  35,  179-197. 

Chambers,  J.  M.,  Cleveland,  W.  S.  ,  Kleiner,  B  ,  &  Tukey,  P.  A  (1983). 
Graphical  methods  for  data  analysis.  Belmont,  CA:  Wadsworth  International 
Group. 

Cochran,  W.  G.  (1952).  The  x2  test  of  goodness  of  fit.  Annals  of 
Mathematical  Statistics,  23,  315-346. 

Crocker,  L,  &Algina,  J.  (1986).  Introduction  to  classical  and  modern  test 
theory.  Fort  Worth,  TX:  Harcourt  Brace  Jovanovich  College  Publishers. 

Divgi,  D.  R.  (1981).  Model  free  evaluation  of  equating  and  scaling. 
Applied  Psychological  Measurement,  5,  203-208. 


105 


106 


Donoghue,  J.  R.  (1998).  An  investigation  of  the  sampling  distribution  of  a 
likelihood  ratio  r2  measure  of  IRT  drift  /  DIF.  Paper  presented  at  the  1998 

Annual  Meeting  of  the  National  Council  on  Measurement  in  Education,  San 
Diego. 

Educational  Testing  Service,  (n.d.).  Description  of  PARSCALE  item-level 
goodness  of  fit  indices.  [Brochure].  Princeton,  NJ:  Author. 

Feingold,  M.  (1992).  The  equivalence  of  Cohen's  kappa  and  Pearson's 
chi-square  statistics  in  the  2  x  2  table.  Educational  and  Psychological 
Measurement,  52,  57-61 . 

Isham,  S.  P.,  &  Donoghue,  J.  R.  (1995  ,  April).  An  investigation  of  the 
sampling  distributions  of  measures  of  IRT  drift  /  DIF.  Paper  presented  at  the 
1 995  Annual  Meeting  of  the  National  Council  on  Measurement  in  Education,  San 
Francisco. 

Hambleton,  R.  K,  &  Swaminathan,  H.  (1985).  Item  response  theory. 
Boston:  Kluwer. 

Kotz,  S.,  Johnson,  N.,  &  Read,  C.  (Eds.).  (1985).  Encyclopedia  of 
statistical  sciences  (Vol.  5).  New  York:  Wiley  &  Sons. 

Lindgren,  B.  W.  (1976).  Statistical  theory.  (3rd  ed.).  New  York: 
MacMillan. 

McKinley,  R.  L,  &  Mills,  C.  N.  (1985).  A  comparison  of  several  goodness- 
of-fit  statistics.  Applied  Psychological  Measurement,  9,  49-57. 

Mislevy,  R.  J.,  &  Bock,  R.  D.  (1990).  BILOG  3:  Item  analysis  and  test 
scoring  with  binary  logistic  models  (2nd  ed.)  [computer  software  manual]. 
Mooresville,  IN:  Scientific  Software,  Inc. 

Mote,  V.  L,  &  Anderson,  R.  L.  (1965).  An  investigation  of  the  effect  of 
misclassification  on  the  properties  of  / 2 -tests  in  the  analysis  of  categorical  data. 
Biometrika,  52,  95-109. 

Muraki,  E.,  &  Bock,  R.  D.  (1996).  PARSCALE:  IRT  based  test  scoring 
and  item  analysis  for  graded  open-ended  exercises  and  performance  tasks 
[software  manual].  Chicago,  IL:  Scientific  Software,  Inc. 

Orlando,  M.,  &Thissen,  D.  (1997).  New  item  fit  indices  for  dichotomous 
item  response  theory  models.  Unpublished  manuscript. 


107 


Raju,  N.  S.,  Bilgic,  R.,  Edwards,  J.  E.,  &  Fleer,  P.  F.  (1997).  Methodology 
review:  Estimation  of  population  validity  and  cross-validity,  and  the  use  of  equal 
weights  in  prediction.  Applied  Psychological  Measurement,  21,  291-305. 

Reise,  S.  P.  (1990).  A  comparison  of  item-  and  person-fit  methods  of 
assessing  model-data  fit  in  IRT.  Applied  Psychological  Measurement,  14, 127- 
137. 

Rogers,  H.  J.,  &  Hattie,  J.  A.  (1987).  A  monte  carlo  investigation  of 
several  person  and  item  fit  statistics  for  item  response  models.  Applied 
Psychological  Measurement,  11,  47-57. 

Rost,  J.,  &  von  Davier,  M.  (1994).  A  conditional  item-fit  index  for  Rasch 
models.  Applied  Psychological  Measurement,  18,  171-182. 

Shavelson,  R.  J.  (1988).  Statistical  reasoning  for  the  social  sciences. 
(2nd  ed.).  Boston:  Allyn  and  Bacon,  Inc. 

Stone,  C.  A,  Ankenmann,  R.  D.,  Lane,  S.,  &  Liu,  M.  (1993,  April). 
Scaling  QUASAR'S  performance  assessment.  Paper  presented  at  the  1 993 
Annual  Meeting  of  the  American  Educational  Research  Association,  Atlanta. 

Stone,  C.  A,  Mislevy,  R.  J.,  &  Mazzeo,  J.  (1994,  April).  Classification 
error  and  ooodness-of-fit  in  IRT  models.  Paper  presented  at  the  1 994  Annual 
Meeting  of  the  American  Educational  Research  Association,  New  Orleans. 

Thissen,  D.  (1991).  MULTILOG  User's  Guide:  Multiple,  categorical  item 
analysis  and  test  scoring  using  item  response  theory  (Version  6.0)  [software 
manual].  Chicago,  IL:  Scientific  Software,  Inc. 

van  der  Linden,  W.  J.,  &  Hambleton,  R.  K.  (eds.)  (1997).  Handbook  of 
modern  item  response  theory.  New  York:  Springer. 

Wright,  B.  D.,  &  Panchapakesan,  N.  (1969).  A  procedure  for  sample-free 
item  analysis.  Educational  and  Psychological  Measurement,  29,  23-48. 

Yen,  W.  M.  (1981 ).  Using  simulation  to  choose  a  latent  trait  model. 
Applied  Psychological  Measurement,  5,  245-262. 

Yen,  W.  M.  (1984).  Effects  of  local  item  dependence  on  the  fit  and 
equating  performance  of  the  three-parameter  logistic  model.  Applied 
Psychological  Measurement,  8, 125-145. 


BIOGRAPHICAL  SKETCH 

Catherine  McClellan  Hombo  was  bom  in  Gainesville,  Florida,  but  moved  to  north 
Alabama  at  the  age  of  two  weeks.  She  lived  there,  attending  Brooks  Elementary  and 
High  Schools,  until  graduation  from  high  school.  At  that  point,  she  returned  to 
Gainesville,  attending  the  University  of  Florida  as  an  undergraduate,  majoring  in 
mathematics.  After  receiving  her  Bachelor  of  Science  degree,  she  remained  at  the 
University  of  Florida  to  complete  a  Master  of  Education  degree  in  mathematics 
Education  as  part  of  the  PROTEACH  Secondary  Education  program. 

Catherine  worked  as  a  mathematics  teacher  at  Mainland  High  School  in  Daytona 
Beach,  Florida  for  two  years.  She  then  accepted  a  position  as  a  math  coordinator  at 
Embry-Riddle  Aeronautical  University,  as  part  of  a  Title  III  grant.  The  position  required 
supervision  of  all  support  services  for  the  Department  of  Mathematics  and  teaching 
courses  in  mathematics  and  the  Freshman  Success  Program.  After  four  years  at 
Embry-Riddle,  she  returned  to  the  University  of  Florida  to  begin  a  program  of  study 
leading  to  a  Ph.D.  in  Research  and  Evaluation  Methodology  in  Education.  Upon 
completing  her  degree,  Catherine  will  begin  work  as  an  Associate  Research  Scientist  at 
Educational  Testing  Service  in  Princeton,  New  Jersey. 


108 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Linda  Crocker,  Chair 
Professor  of  Foundations  of 
Education 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


James  Algina 
Professor  of  Fourtf 
iucation 


ions  of 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


M.  David  Miller 

Professor  of  Foundations  of 


Education 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Thomasenia  Lott  Adams 
Associate  Professor  of 
Instruction  and  Curriculum 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Jean  Larson 

Professor  of  Mathematics 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  College  of 
Education  and  the  Graduate  School  and  was  accepted  as  partial  fulfillment  of 
the  requirements  for  the  degree  of  Doctor  of  Philosophy. 


December,  1998 


Chairman,  Foundations  of 
Education 


Dean,  Graduate  School 


